Efficient Query Processing Over Inconsistent Databases by ArielDami´anFuxman A thesis submitted in conformity with the requirements for the degree of Ph.D. in Computer Science Graduate Department of Computer Science University of Toronto Copyright c 2007 by Ariel Dami´an Fuxman
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Efficient Query Processing Over Inconsistent Databases
by
Ariel Damian Fuxman
A thesis submitted in conformity with the requirementsfor the degree of Ph.D. in Computer ScienceGraduate Department of Computer Science
The presence of inconsistent data is known to be a major problem in enterprises. How-
ever, data analysts often make business decisions based on inconsistent data; and their
database systems rarely give any warning or indication about this situation. In fact,
current database management systems are largely unable to give such a warning because
they rely upon the fundamental assumption that the underlying data is consistent. In
this thesis, we tackle this problem by providing a set of tools that enable users to obtain
meaningful answers from databases even if they are partially inconsistent.
Integrity constraints have long been used by database management systems in order
to maintain data consistency. The typical data design process focuses on developing a set
of constraints that ensure that every possible database reflects a valid, consistent state
of the world. However, integrity constraints may not always be enforced or satisfied for
a number of reasons. For example, when data is integrated from multiple sources, each
source may satisfy a constraint (for example, a key constraint), but the merged data may
not (for example, if the same key value exists in multiple sources). More generally, when
data is exchanged between independently designed sources with different constraints, the
exchanged data may not satisfy the constraints of the destination schema. As another
example, in some environments, checking the consistency of constraints may be too ex-
pensive, particularly for workloads with high update rates. Hence, the database may
become inconsistent with respect to the (unenforced) integrity constraints. In addition
to these long-standing problems, the trend toward autonomous computing is making the
need to manage inconsistent data more acute. In autonomous environments, we can no
1
Chapter 1. Introduction 2
longer assume that data are married with a single set of constraints that define their
semantics. As constraints are used in an increasing number of roles (from modelling
the query capabilities of a system, to defining mappings between independent sources),
there is an increasing number of applications in which data must be used with a set
of independently designed constraints. In such applications, a static approach where
consistency (with respect to a fixed set of constraints) is enforced on the database may
not be appropriate. Rather, a dynamic approach in which inconsistent data is tolerated,
but consistency is taken into account at query time, permits the constraints to evolve
independently from the data.
One strategy for managing inconsistent databases is data cleaning [DJ03]. Data
cleaning techniques seek to identify and correct errors in the data, and can be used to
restore an inconsistent database to a consistent state. Data cleaning, when applicable,
can be very successful. However, it is necessarily a semiautomatic process, which makes
it infeasible or unaffordable for some applications. Furthermore, committing to a single
cleaning strategy may not always be appropriate. A user may wish to experiment with
different cleaning strategies, or may desire to retain all data, even inconsistent data,
for tasks such as lineage tracing. Finally, data cleaning is only applicable to data that
contains errors. However, the violation of a constraint may also indicate that the data
contains exceptions, that is, clean data which simply does not satisfy a constraint.
In this thesis, we consider inconsistent databases that may violate a set of primary
key constraints. This type of constraint (together with foreign key constraints) are the
most commonly used in commercial databases systems. Furthermore, databases that
violate primary key constraints are ubiquitous in enterprises. For example, in the domain
of Customer Relationship Management (CRM), data sources often contain conflicting
information about the same customer. Notably, commercial CRM tools provide limited
support for merging tuples corresponding to the same customer into one tuple in the
integrated database. Although they typically support some form of conflict resolution
rules (e.g., rules that take the average between two conflicting incomes of the same
customer), these rules may be difficult to design. In the absence of conflict resolution
rules, some CRM tools transfer all conflicting tuples to the integrated database. Thus,
even if the sources satisfy the key constraints, the integrated database may not.
Chapter 1. Introduction 3
1.2 Consistent Query Answering
While it is well known how to answer queries over consistent databases, we must give
a clear and precise semantics to the notion of a “meaningful” answer obtained from an
inconsistent database. In this thesis, we make use of a semantics based upon the notions
of possible worlds and certain answers, concepts that are widely used not only in the
context of database theory and data integration [Lip79, Lip81, AKG87, AD98], but also
in the field of knowledge representation [Lev81, Moo85]. These notions were first adapted
to the context of inconsistent databases by Arenas, Bertossi and Chomicki [ABC99], who
defined the semantics of consistent query answers.
The semantics of consistent query answers relies on the intuition that an inconsistent
database can be cleaned (or “repaired”) by adding or deleting tuples in such a way that
the resulting database satisfies some given integrity constraints. The semantics is agnostic
about which tuples should be added or removed. Therefore, each inconsistent database
may be associated to more than one clean, consistent database. A consistent answer is
then an answer that is obtained from every possible consistent database. Intuitively, this
means that the consistent answers are obtained no matter how the database is cleaned.
The semantics of consistent query answers provides a sound and elegant basis for the
study of the problem of query answering over inconsistent databases. However, despite
considerable work on its theoretical underpinnings [ABC99, CB00, ABC+03b, CLR03a,
CLR03b, BB03a, BB03b, CM05], to the best of our knowledge, little work has been
done on its practical applications. A key contribution of this thesis is to bridge the
gap between theory and practice by providing an efficient and scalable system to obtain
consistent query answers from inconsistent databases. In particular, we report the design
and evaluation of ConQuer, a system for managing inconsistent data.1 In ConQuer, a
user may postulate a set of integrity constraints, possibly at query time, and the system
automatically retrieves all (and only) the query answers that are consistent with respect
to the constraints. ConQuer also helps users take advantage of the query results in order
to interactively clean the inconsistent database.
The major challenge in consistent query answering is the potentially huge number
of consistent databases that can be associated with a given inconsistent database. In
the case of primary key constraints, that is the focus of this thesis, the number of con-
1ConQuer stands for Consistent Querying. ConQuer’s web page can be found atwww.cs.toronto.edu/db/conquer.
Chapter 1. Introduction 4
emplKey salary
t1 John 1000
t2 John 2000
t3 Mary 1000
Figure 1.1: An inconsistent database
sistent databases is exponential in the size of the inconsistent database. This problem
is tackled in ConQuer by implementing a query rewriting approach. Given a query q,
ConQuer rewrites q into another query Q that has the following property: for every incon-
sistent database, the rewritten query Q retrieves the consistent answers for the original
query q. The rewriting is done independently of the data, and works on every inconsistent
database. This approach has two fundamental advantages. First, it avoids constructing
the (potentially huge number of) consistent databases associated with the inconsistent
database. Second, the rewritten query is a SQL query that can be executed using any
commercial relational database management system (in ConQuer, we use IBM’s DB2).
In an extensive set of experiments, reported in Chapter 7, we show that the overhead
in the execution of the rewritten queries is reasonable, when compared to the original
(non-rewritten) ones.
In the next example, we illustrate the semantics of consistent answers and the query
rewriting approach.
Example 1.1. Consider the database of Figure 1.1, which contains information about
employees and their salaries. In particular, the schema of the database has one relation
called employee, with two attributes: emplKey (the name of the employee) and salary.
Assume that a user specifies that the key of the relation should be the attribute
emplKey. Note that the database violates this key constraint, perhaps because its data
has been integrated from many operational sources. In particular, there are two tuples
for employee John, one stating that he makes a salary of 1000, and the other stating that
he makes a salary of 2000. Suppose that we do not know which one of this alternatives is
correct, but we still want to be able to draw meaningful answers from the database. Let
us consider the consistent databases (i.e., databases that satisfy the key constraint) that
can be built from the inconsistent database. We would like these databases to be not
only consistent, but also “as close as possible” to the inconsistent database. This leaves
Chapter 1. Introduction 5
emplKey salary emplKey salary
t1 John 1000 t2 John 2000
t3 Mary 1000 t3 Mary 1000
Consistent database 1 Consistent database 2
Figure 1.2: Consistent databases for the inconsistent database of Figure 1.1
us with two possible consistent databases (shown in Figure 1.2), obtained by deleting
exactly one tuple for John in each of them.
Consider a query q1 that retrieves information about customers whose salary is less
or equal than 1000.
q1: select distinct emplKey
from employee
where salary <= 1000
If we execute this query directly over the inconsistent database, we obtain {John, Mary}.Intuitively, this is not a “consistent” answer because it may be the case that John has a
salary over 1000. In fact, if the consistent database turns out to be the database on the
right hand side of Figure 1.2, then John would not appear in the answer.
One strategy to obtain the “consistent answer” would be to apply query q1 to each
of the consistent databases of Figure 1.2. While this may be feasible in this simple
example, it is clearly impractical when the number of tuples violating the constraint
grows. In particular, even for the schema and single constraint of this example, the
number of consistent databases is exponential in the size of the inconsistent database.
For this reason, in ConQuer, we never build the consistent databases explicitly. Instead,
we follow a query rewriting approach, where we rewrite the original query (q1 in this
case) into another query that can be executed directly on the inconsistent database and
is guaranteed to always return the consistent answers for the original query.
In this case, it is quite simple to obtain a rewriting of q1. Notice that John appears
associated with two different salaries in the inconsistent database: one satisfying the
query, the other not. This suggests that in the rewriting we should return the employees
that satisfy q1 (i.e., have a salary of less or equal than 1000) in every tuple of the
inconsistent database where they appear. This can be obtained using the following
query:
Chapter 1. Introduction 6
Q1: select distinct emplKey
from employee e
where salary <= 1000
and not exists (select *
from employee e’
where e’.emplKey=e.emplKey
and c’.salary > 1000)
Notice the use of a nested subquery related by not exists. The purpose of this
subquery is to filter out those key values that satisfy q1 in some tuples, but violate it in
others. In our example, this subquery filters John out of the answer because he appears
in tuple t2 with an account balance above 1000.
Despite the simplicity of the previous example, it has been shown in the literature
[CLR03a, CM05] that there are Select-Project-Join queries for which there is no rewriting
into SQL (under a very likely complexity-theoretic assumption). However, we observe
that the presence of these negative results does not necessarily preclude the existence of
classes of queries for which there is a SQL rewriting. In fact, in Chapter 3, we show a
large and practical class of Select-Project-Join queries for which there is a SQL rewriting.
In Chapter 5, we show that this is a maximal class of queries, in the sense that minimal
relaxations of its conditions lead to queries for which there is no SQL rewriting.
Most of the previous work on consistent query answering (except [ABC+03b]) focuses
on queries with set semantics and no aggregation. However, practical query languages
like SQL have bag semantics (duplicates are not eliminated unless explicitly requested),
and support aggregation functions and grouping of results. In Chapter 2, we present
a generalization of the semantics of consistent answers for queries with bag semantics,
grouping and aggregation. In Chapter 4, we provide query rewritings that work under
this semantics.
In the thesis, we are concerned not only with the correctness of the rewritings (i.e.,
ensuring that they retrieve all and only the consistent answers), but also with their
efficiency when executed using existing database technology. We address efficiency issues
and their empirical validation in Chapters 6 and 7.
Chapter 1. Introduction 7
1.3 Contributions
The main contributions of this thesis are the following:
• We identify a large and practical class of Select-Project-Join queries for which the
problem of computing consistent answers is tractable. The class consists of queries
that can have two kinds of joins. First, they can have joins between key attributes.
Second, they can have joins from non-key attributes of a relation (possibly a foreign
key) to the primary key of another relation. Arguably, these two types of joins are
the most commonly used in practice (and certainly the most common in industry
standard benchmarks like TPC-H). (Chapter 3)
• For the class of tractable queries that we identify, we provide a query rewriting algo-
rithm that produces a query in first-order logic that returns the consistent answers.
The algorithm runs in polynomial time in the size of the query. The rewritings
are sound and complete, in the sense that they return all (and only) the consistent
answers. Since first-order queries can be written in SQL, the rewritings in first-
order logic are a first step towards reusing existing commercial database technology.
This work was first published at the International Conference on Database Theory
(ICDT) [FM05], and an extended journal version has been invited to the Journal
of Computer and Systems Sciences (JCSS) [FM06]. (Chapter 3)
• We consider not only Select-Project-Join queries with set semantics, but also queries
with bag semantics, grouping and aggregation. These extensions are needed to en-
able practical use in decision support applications. For this purpose, we extend
the semantics of consistent answers originally proposed by Arenas, Bertossi and
Chomicki [ABC99, ABC+03b] . We provide sound and complete algorithms un-
der this semantics for the most common SQL aggregation functions (count, min,
max, sum). This work has been published at the ACM International Conference
on the Management of Data (SIGMOD) [FFM05a]. (Chapters 2 and 4)
• We show a large class of Select-Project-Join queries for which the conditions of
applicability of our rewriting algorithm are not only sufficient but also necessary.
In particular, we show a class in which the problem of computing the consistent
answers is coNP-complete (and, assuming P 6= NP, inexpressible in first-order logic)
for every query of the class that violates the conditions of the class of queries for
Chapter 1. Introduction 8
which we give a rewriting algorithm. This type of result is stronger than the com-
plexity results given in the consistent query answering literature [CLR03a, CM05],
which consist of showing intractability of a class by exhibiting at least one query for
which the problem is intractable. As a corollary of our result, we get a dichotomy
for this class of queries: given a query q in our class, either the problem of comput-
ing the consistent answers for q is first-order rewritable (and thus it is in PTIME),
or it is a coNP-complete problem. (Chapter 5)
• We present the implementation of ConQuer, a system for querying inconsistent
databases. We also explain in detail the SQL rewritings produced by the system.
ConQuer has been demonstrated at the International Conference on Very Large
Databases (VLDB) [FFM05b]. (Chapter 6)
• We study the running time of ConQuer’s SQL rewritings on a commercial database
system, in particular IBM DB2. To this end, we present a detailed performance
study using the data and queries of the TPC-H decision support benchmark. The
study focuses on the overhead of the rewritings, using the original (non-rewritten
queries) as a baseline. We study the scalability of the approach (with databases of
up to 172 million tuples), and the effect of the degree of inconsistency (in terms
of the percentage of tuples that are inconsistent and the number of conflicting
tuples per key value). The experiments show that our approach can be applied to
large databases, several orders of magnitude larger than those considered in other
approaches for querying inconsistent databases. (Chapter 7)
1.4 Organization of the Document
The rest of this document is organized as follows. In Chapter 2, we present the formal
framework for querying inconsistent databases that will be used throughout the thesis.
In Chapters 3 and 4, we present query rewritings and focus on proving their correctness.
In Chapter 3, we consider a large and practical class of conjunctive queries (that is,
Select-Project-Join queries) and present rewritings in first-order logic. In Chapter 4, we
consider queries with bag semantics, grouping and aggregation, and present rewritings
in an extension of first-order logic with grouping and aggregation functions. In Chapter
5, we show the maximality of the class of queries that is the input to the rewriting
algorithms.
Chapter 1. Introduction 9
In Chapter 6, we present ConQuer, a system for efficiently querying inconsistent
databases. We present in detail the SQL query rewritings produced by ConQuer for
queries with and without aggregation. The efficiency of these rewritings is empirically
validated in Chapter 7 with an extensive set of experiments. We present related work in
separate sections at the end of each of the chapters. In Chapter 8, we finish the document
with conclusions and directions for future work.
Chapter 2
Formal Framework
In this chapter, we present the formal framework that will be used throughout the thesis.
In this framework, an inconsistent database is associated with a space of consistent
databases called repairs. In Section 2.1, we formally define the notion of repair. Then, in
Section 2.2, we introduce the semantics for query answering over inconsistent databases.
This semantics involves the exploration of all repairs of an inconsistent database. Since
the number of repairs can be very large, in this thesis we advocate a query rewriting
approach, where queries are rewritten in such a way that their consistent answer can be
obtained by posing another query directly on the inconsistent database, without explicitly
building any repair. In Section 2.3, we formally define the notion of a query rewriting.
Finally, in Section 2.4, we introduce the integrity constraints that are the focus of this
thesis.
2.1 Repairs
A schema R is a finite collection of relation symbols, each of which has an associated
arity. A database instance (or database) I over R is a function that associates each
relation symbol r of R to a relation I(r). A relation I(r) of arity k is a set of k-tuples
whose elements belong to some underlying fixed domain.1 Whenever it is clear from
context, we will abuse notation and use the same symbol r to denote both a relation
symbol and a relation. Given a tuple ~t occurring in relation I(r), we denote by r(~t) the
association between ~t and r.
1Although we will consider both set and bag semantics for queries, we always assume the relations ofa database instance (including inconsistent databases) to be sets.
10
Chapter 2. Formal Framework 11
A database instance I is consistent with respect to a set of integrity constraints Σ if
I satisfies Σ in the standard model-theoretic sense, that is I |= Σ. (As customary, an
integrity constraint may be any first-order formula [AHV95]). Throughout this thesis,
we will consider databases that may violate a given set of integrity constraints. That is,
given R and set of integrity constraints Σ over R, a database I may be inconsistent with
respect to Σ, that is I 6|= Σ.
Intuitively, we will assume that an inconsistent database can be cleaned (or “re-
paired”) by adding or deleting tuples in such a way that the resulting database satisfies
the given integrity constraints. We will be agnostic about which tuples should be added
or removed. Therefore, each inconsistent database may be associated to more than one
possible clean, consistent database. Furthermore, no matter how the clean databases are
obtained, we would like them to be “as close as possible” to the original, inconsistent
database (that is, to minimize the number of tuples that are added or removed). We will
call each consistent database a repair.
The notion of repair was originally introduced by Arenas, Bertossi and Chomicki
[ABC99]. A repair is a database instance that satisfies the given integrity constraints,
and which has a minimal distance to the inconsistent database. The distance between
two database instances I and I ′ is defined as their symmetric difference, i.e., ∆(I, I ′) =
(I − I ′) ∪ (I ′ − I). The formal definition of repair is the following.
Definition 2.1 (Repair [ABC99]). Let I be a database instance, and Σ be a set of
integrity constraints. We say that an instance I is a repair of I with respect to Σ if:2
• I |= Σ, and
• there is no instance I ′ such that I ′ |= Σ and ∆(I, I ′) ⊂ ∆(I, I) (i.e., ∆(I, I) is
minimal under set inclusion in the class of instances that satisfy Σ).
Example 2.1. Let R be a schema with one relation symbol employee. Assume that
employee has two attributes: emplKey (the name of the employee) and salary, and
that the only constraint in Σ is that attribute emplKey is the key of relation employee.
Let I = {employee(John, 1000), employee(John, 2000), employee(Mary, 1000)}. The
database I is inconsistent with respect to Σ because it violates the key constraint stating
that every employee has exactly one salary.
2Whenever Σ is clear from the context, we will just say that I is a repair of I.
Chapter 2. Formal Framework 12
There are two repairs of I wrt Σ: I1 = {employee(John, 1000), employee(Mary, 1000)}and I2 = {employee(John, 2000), employee(Mary, 1000)}. Notice that, according to
Definition 2.1, the databases {employee(John, 2000)} and {employee(Mary, 1000)} are
not repairs because their distance with respect to I is not minimal under set inclusion.
The minimality condition for repairs is crucial in the definition. Otherwise, the empty
set would trivially be a repair of every database that violates a set of key constraints.
Notice that repairs do not need to be unique. For example, if the given set of con-
straints consists of key dependencies, the number of repairs can be exponential in the
size of the inconsistent database.
2.2 Query Answering Semantics
The notion of repair can be used to give a precise meaning to query answering over
inconsistent databases. Intuitively, each repair corresponds to one particular way of
cleaning the database. Since we are agnostic about how the database should be cleaned,
it makes sense to consider the answers that would be obtained from every repair. This
notion is formalized with the concept of consistent answers, which we define next.
Definition 2.2 (Consistent Answer [ABC99]). Let R be a schema. Let Σ be a set
of integrity constraints. Let I be an instance over R (possibly inconsistent with respect
to Σ). Let q be a query over R. We say that a tuple ~t is a consistent answer for q with
respect to Σ if ~t ∈ q(I), for every repair I of I with respect to Σ. We denote this as
~t ∈ consistentΣ(q, I).
This definition was originally given by Arenas, Bertossi and Chomicki [ABC99]. It is
based on the semantics of certain answers [Lip79, Lip81, AKG87] that has been used in
database theory, and possible worlds, which is well-known in knowledge representation
[Lev81]. In the case of consistent answers, the space of possible worlds corresponds to
the repairs of the inconsistent database.
Example 2.1. (continued) Consider a query that retrieves all the employees from
the database, expressed as q1(e) = ∃s.employee(e, s). Recall that there are two re-
pairs of I wrt Σ: I1 = {employee(John, 1000), employee(Mary, 1000)} and I2 =
{employee(John, 2000), employee(Mary, 1000)}. The result of applying q1 on both I1
Chapter 2. Formal Framework 13
and I2 is {(John), (Mary)}. Thus, the consistent answers for q1 on I are the tuples
(John) and (Mary).
Now, consider a query that retrieves employees together with their salaries, expressed
as q2(e, s) = employee(e, s). Notice that q2 is the identity on the repairs. Thus, the con-
sistent answer can be obtained as the intersection of I1 and I2. In consequence, the only
consistent answer for q2 on I is (Mary, 1000). Notice that the tuples (John, 1000) and
(John, 2000) are not consistent answers. The reason is that neither of them are present
in both repairs. Intuitively, this reflects the fact that John’s salaries are inconsistent data,
and we do not want to retrieve possibly erroneous results.
For convenience, we will use the following notation for the consistent answers of
Boolean queries.
Definition 2.3. Let R be a schema. Let Σ be a set of integrity constraints. Let
I be a database instance over R. Let q be a Boolean query over R. We say that
consistentΣ(q, I) = true if for every repair I of I with respect to Σ, I |= q. We
say that consistentΣ(q, I) = false if there exists at least one repair I of I with respect
to Σ such that I 6|= q.
Notice the asymmetry between the case for consistentΣ(q, I) = true and
consistentΣ(q, I) = false. While for the former, every repair must satisfy the query,
for the latter it suffices to have just one non-satisfying repair. This is not intrinsic to
Boolean queries: by Definition 2.2, it is also the case that ~t 6∈ consistentΣ(q, I) if there
exists at least one repair I such that ~t 6∈ q(I).
The definition of consistent answers is independent of the language used to express
the input query q, and it makes perfect sense for queries that, for example, return tuples
from the active domain of the database. However, for queries that compute aggregates
over groups of tuples, it may be useful to relax this definition, as we motivate next.
Example 2.1. (continued) Let q3(s, v) be a SQL query that counts the number of
occurrences of each salary in the database:
select salary as s, count(*) as v
from employee
group by salary
Chapter 2. Formal Framework 14
Recall that there are two repairs of I with respect to Σ: I1 = {employee(John, 1000),
employee(Mary, 1000)} and I2 = {employee(John, 2000), employee(Mary, 1000)}. The
result of applying query q3 to the repairs is the following: q3(I1) = {(1000, 2)}, and
q3(I2) = {(1000, 1), (2000, 1)}. Since the intersection of these results is empty, according
to Definition 2.2, the set of consistent answers for q3 is empty. However, notice that the
salary 1000 appears in every query result (but together with a different number for the
count of occurrences). Intuitively, it would be desirable to report this salary in the result.
In the previous example, the value 1000 appears in every query result. However, it
appears a different number of times on each of them. How do we report the number of
times that it appears? In the semantics that we define next, we employ tight bounds
for this purpose. In this particular example, we will say that the minimum (greatest
lower bound) is one, since the salary 1000 appears exactly once in q3(I1); and that the
maximum (lowest upper bound) is two, since salary 1000 appears exactly twice in q3(I2).
In the following definition, we formalize this notion. The definition applies to any query
that computes an aggregate over a group (in our example, the aggregate is the count
of occurrences of each salary). We will denote with aggconsistentΣ(q, I) the modified
semantics for consistent answers for a query q on an instance I with respect to a set of
constraints Σ.
Definition 2.4 (Consistent Answer for Queries with Aggregation). Let R be
a schema. Let Σ be a set of integrity constraints. Let I be a database instance over
R. Let q be a query over R with free variables ~z and v, where v is a variable over a
numeric domain (possibly computed by an aggregate function). We say that (~t, glb, lub) ∈aggconsistentΣ(q, I) if all the following conditions hold:
• for every repair I of I wrt Σ, there is some d such that (~t, d) ∈ q(I) and glb ≤ d ≤lub; and
• there is some repair I of I wrt Σ such that (~t, glb) ∈ q(I); and
• there is some repair I of I wrt Σ such that (~t, lub) ∈ q(I).
We also say that glb is the greatest lower bound of ~t in q, and that lub is the lowest
upper bound of ~t in q.
This definition is particularly well suited to the case of queries with bag semantics,
grouping and aggregation, which are prevalent in practice. For instance, consider the
query q3(s, v) of Example 2.1:
Chapter 2. Formal Framework 15
select salary as s, count(*) as v
from employee
group by salary
In this case, q3 has free variables s and v. The variable s corresponds to the attribute
salary, on which there is a grouping condition; the numerical argument v, for which we
give tight ranges, corresponds to the result of count(*). Essentially, for a query q(~z, v),
aggconsistentΣ(q, I) gives the consistent answers on I with respect to Σ for each value
of ~z (the salary in our example), together with a tight range for the possible associated
numerical values.
Example 2.1. (continued) Let us obtain the aggconsistentΣ answers for q3 on I. Re-
call that the result of applying q3 to the repairs of the inconsistent database is: q3(I1) =
{(1000, 2)}, and q3(I2) = {(1000, 1), (2000, 1)}. Then, we have that aggconsistentΣ(q3, I) =
{(1000, 1, 2)}. This means that the salary 1000 appears in every query result, and the
value of count(*) for 1000 has a greatest lower bound of one and a lowest upper bound
of two. Notice that the salary 2000 does not appear in aggconsistentΣ(q3, I). The intu-
itive reason is that 2000 is not a consistent answer, since it does not occur in repair I1.
According to the definition of aggconsistentΣ above, 2000 is not in the answer because
it fails to satisfy the first condition of Definition 2.4. This condition is violated because
I1 is a repair such that (2000, d) 6∈ q(I1), for every d.
To the best of our knowledge, the problem of computing consistent answers for queries
with aggregation has only been studied before by Arenas et al. [ABC+03b]. In particular,
they were the first to propose a generalization of the semantics of consistent answers,
where ranges rather than exact values are returned. In their work, they consider a class
of SQL queries with no grouping, no selection conditions (i.e., no conditions in the where
clause) and on exactly one relation. In Chapter 4, we will present results for a much
larger class of queries. For the class of queries considered by Arenas et al., our and their
semantics coincide. However, we need to extend their semantics in order to be able to
deal with grouping.
2.3 Query Rewritings
The definition of consistent answers introduced in the previous section involves the explo-
ration of a potentially huge number of repairs (in the case of keys, it can be exponential in
Chapter 2. Formal Framework 16
the size of the inconsistent database). In this thesis, we approach this problem by design-
ing algorithms that compute consistent answers directly from the inconsistent database,
without explicitly building the repairs. Given a query q, our algorithms will return an-
other query Q such that, for every instance I, the consistent answers for the original
query q can be obtained by just evaluating Q on I. We call Q a query rewriting for the
problem of computing the consistent answers of q.
In order to give a formal definition of query rewriting, we first define the computa-
tional problems associated to computing consistent answers using the consistentΣ and
aggconsistentΣ operators (the latter for the case in which the query computes numerical
values over a group of tuples).
Definition 2.5. Let R be a schema. Let q be a query over R. Let Σ be a set of integrity
constraints.
The problem CONSISTENT(q, Σ) is the following: given an instance I over R, and
tuple ~t, is it the case that ~t ∈ consistentΣ(q, I)?
The problem AGGCONSISTENT(q, Σ) is the following: given an instance I over R, tuple
~t and real numbers glb and lub, is it the case that (~t, glb, lub) ∈ aggconsistentΣ(q, I)?
We can now define the notion of query rewriting for the problems CONSISTENT(q, Σ)
and AGGCONSISTENT(q, Σ). The definition is given for a fixed (but undefined) query
language.
Definition 2.6 (L-query rewriting). Let R be a schema. Let Σ be a set of integrity
constraints. Let q be a query over R. Let Q be a query expressed in a query language L(possibly different from the language used to express q).
We say that Q is an L-rewriting of CONSISTENT(q, Σ) if for every instance I over R
We will show in Chapter 5 that the problem of computing consistent answers for the
above queries is intractable. The first query consists of a join between nonkey attributes;
the second one involves a cycle of nonkey-to-key joins; and in the third, there are two
joins from nonkey variables to part, but not the entire key, of the corresponding relations.
In order to be more precise in specifying such conditions, we need the notion of the join
graph of a query, which has a node for each literal of a query. Notice that the conditions
Chapter 3. Rewritings for Conjunctive Queries 24
that we just gave are concerned with joins where at least one nonkey variable is involved.
Therefore, the join graph will be a directed graph, where directionality is determined by
the nonkey variables involved in the join.
Definition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a
directed graph such that:
• the vertices of G are the literals of q;
• there is an arc from Ri to Rj if i 6= j, and there is some variable w such that w is
existentially-quantified in q, w occurs at the position of a nonkey attribute in Ri,
and w occurs in Rj.
Notice that key-to-key joins do not introduce any arcs to the join graph. Since the
class of first-order rewritable queries that we will present shortly is defined in terms of
the join graph, its queries can have arbitrary key-to-key joins. Further, the free variables
of a query do not introduce arcs to the join graph. As a special case, if all the variables
of a query are free, then its join graph has no arcs. Such queries correspond to the
class of quantifier-free queries, and have already been shown to be first-order rewritable
[ABC99]. If we think in terms of equivalent SQL queries, the fact that all variables are
free means that every attribute of every relation in the from clause must appear in the
select clause.1 This a strong condition which restricts the practical applicability of
the class. As an empirical observation, none of the queries in the TPC-H specification
[TPC03], the industry standard for decision support systems, satisfy this restriction. For
this reason, we will focus on a class of conjunctive queries that may have existential
quantification (in relational algebra terms, arbitrary projections). Handling queries with
existentially-quantified variables is a major challenge, which we address in this chapter.
In Figure 3.1, we show the join graphs for q1 and q2 (we label the arcs with the variable
involved in the joins for illustration purposes). Observe in the figure that both join graphs
have a cycle. For our rewriting algorithm, we will focus on queries that have an acyclic
join graph. Additionally, when we consider how two literals Ri and Rj are joined, we will
require that if any of the key attributes of Ri are joined with a nonkey attribute of Rj,
then all of the key attributes of Ri join with nonkey attributes of Rj. We will then say
that the query has only full nonkey-to-key joins. For example, in the query q3 above, of
1The only exception are the attributes that are equated in the where clause. In that case, only oneof the equated attributes needs to appear in the select clause.
Chapter 3. Rewritings for Conjunctive Queries 25
the form ∃x, x′, w, w′, z, z′,m.R1(x,w)∧R2(m,w, z)∧R3(x′, w′)∧R4(m,w′, z′), the joins
between R1 and R2, and between R3 and R4, are not full since they do not involve the
entire key of R2 and R4, respectively.
Definition 3.2. Let q be a conjunctive query. Let Ri(~xi, ~yi) and Rj(~xj, ~yj) be a pair of
literals of q. We say that there is a full nonkey-to-key join from Ri to Rj if every variable
of ~xj appears in ~yi.
We observe that if G is an acyclic join graph for a query all of whose nonkey-to-key
joins are full, then G must be a forest. We show this with the following proposition.
Proposition 3.3. Let q be a query all of whose nonkey-to-key joins are full. Let G be
the join graph of q. If G is acyclic, then G is a forest.
Proof. Assume towards a contradiction that G is a directed acyclic graph that is not a
tree. Then, there is a node v in G that receives arcs from two different nodes vi and vj
of G. Let R(~x, ~y), Ri(~xi, ~yi), and Rj(~xj, ~yj) be the literals at the nodes of v, vi, and vj,
respectively. Since there are arcs from vi and vj to v, there are variables wi and wj in
~yi and ~yj, respectively, that appear in R. Since G is acyclic, wi and wj must appear in
~x. Also, wj cannot appear in a nonkey position of Ri (or, otherwise, there would be a
cycle between the nodes vi and vj). Since there is a nonkey-to-key join from Ri to R on
variable wi, and variable wj does not occur at a nonkey position of Ri, the join is not
full; contradiction.
3.1.3 The Class Cforest of First-Order Rewritable Queries
We will now characterize a broad class of conjunctive queries for which the problem of
computing consistent answers under key constraints is tractable and first-order rewritable.
The characterization is given in terms of the join graph of the queries. In particular, we
will require three conditions. First, all the nonkey-to-key joins of the query must be full.
Second, the join graph must be a forest. As we showed in Proposition 3.3, this includes
all queries with full nonkey-to-key joins with acyclic join graph. Finally, the query should
have no repeated relation symbols. We call this class Cforest since we require the join
graph of its queries to be a forest, and we give the formal definition next.
Definition 3.4. Let q be conjunctive query without repeated relation symbols and all of
whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q ∈ Cforest
if G is a forest (i.e., every connected component of G is a tree).
Chapter 3. Rewritings for Conjunctive Queries 26
Figure 3.1: Cyclic join graphs of intractable queries
A fundamental observation about Cforest is that it is a very common, practical class
of queries. Arguably, the most used form of joins are from a set of nonkey attributes of
one relation (which may be a foreign key)2 to the key of another relation (which may be
a primary key). Furthermore, such joins typically involve the entire primary key of the
relation (and, hence, they are full joins in our terms). Finally, cycles are rarely present
in the queries used in practice. Admittedly, the restriction not to have repeated relation
symbols does rule out some common queries (those in which the same relation appears
twice in the from clause of an SQL query). Still, many queries used in practice do not
have repeated relation symbols.
As an empirical observation, only one out of 22 queries in the TPC-H specification
[TPC03], the industry standard for decision support queries, has a nonkey-to-nonkey
join. All the queries in the standard are acyclic, and all the nonkey-to-key joins of the
queries are full.
3.2 Query Rewriting Algorithm
In this section, we present the query rewriting algorithm RewriteForest that works for
the class of conjunctive queries Cforest introduced in the previous section. We start the
presentation with a number of examples that highlight some of the intuition underlying
the algorithm.
In the next example, we illustrate the rewriting for a query consisting of only one
2Notice that we are not dealing with the problem of inconsistency with respect to foreign keys, butonly with respect to key dependencies.
Chapter 3. Rewritings for Conjunctive Queries 27
literal. We also show that even for such a simple query, the query itself is not a rewriting
for the problem of computing its own consistent answers.
Example 3.1. As in Example 2.1, consider a schema R with one relation symbol
employee, which has two attributes: emplKey (the name of the employee) and salary.
Furthermore, consider a set Σ consisting of only one constraint stating that the attribute
emplKey is the key of relation employee.
Let q1 be a query that retrieves all the employees from the database that make
a salary of 1000, expressed as q1(e) = employee(e, 1000). First of all, notice that q1
itself is not a query rewriting of CONSISTENT(q1, Σ). Consider a database instance I1 =
{employee(John, 1000), employee(John, 2000)}. It is easy to see that (John) ∈ q1(I1).
However, (John) 6∈ consistentΣ(q1, I1) because the repair I = {employee(John, 2000)}is such that (John) 6∈ q1(I).
We now proceed to present RewriteForest, the query rewriting algorithm for queries
in Cforest (shown in Figures 3.2, 3.3, and 3.4). Given a query q such that q ∈ Cforest
and a set of key constraints Σ (containing one key per relation), RewriteForest(q, Σ)
returns a first-order rewriting Q for the problem of obtaining the consistent answers
for q with respect to Σ. The main procedure of the algorithm is shown in Figure 3.2.
The first-order rewriting Q that it returns is obtained as the conjunction of the input
query q, and a new query called Qconsist. The query Qconsist is used to ensure that q is
satisfied in every repair. It is important to notice that Qconsist will be applied directly to
the inconsistent database (i.e., we will never explicitly generate the repairs). The query
Qconsist is obtained by recursion on the tree structure of each of the components of the
join graph of q (recall that since q is in Cforest, the join graph is a forest). The recursive
procedure is called RewriteTree, and is shown in Figure 3.3.
The first part of RewriteTree produces a rewriting Qlocal for the literal R(~x, ~y) at the
root of the input tree. This rewriting is done independently of the rest of the query, and
it is produced by the procedure RewriteLocal (shown in Figure 3.4). The query Qlocal
deals with the constants that appear in ~y in the same way as we illustrated in Example
3.1. It also deals with the free variables that appear at nonkey positions of the query in
the way that we illustrate in the next example.
Example 3.3. Consider the query q3 that retrieves all employees and their salaries from
the database, expressed as q3(e, s) = employee(e, s). Notice that the only difference with
the query q1 of Example 3.1 is that the constant 1000 is replaced by the free variable
Chapter 3. Rewritings for Conjunctive Queries 29
Algorithm RewriteForest(q, Σ)
Input: q(~z), a query of the form ∃~w.φ(~w, ~z)
Σ, a set of key constraints, one per relation used in q
Output: Q, a first-order query that computes consistentΣ(q, I) for every database I
Let G be the join graph of q
Let T1, . . . , Tm be the connected components of G
for i := 1 to m do
Let Ri(~xi, ~yi) be the literal at the root of Ti
Let φi be the conjunction of literals of Ti
Let ~wi = {w : w is a variable that occurs in φi and ~w, and w 6∈ ~xi}Let ~zi = {z : z is a variable that occurs in φi and ~z, and z 6∈ ~xi}Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi)
Let Qi(~xi, ~zi) = RewriteTree(qi, Σ)
end for
Let Qconsist(~w, ~z) =∧
i=1...m Qi(~xi, ~zi)
Let Q(~z) = ∃~w.(φ(~w, ~z) ∧Qconsist(~w, ~z))
return Q
Figure 3.2: Query rewriting algorithm for conjunctive queries in Cforest
s. The algorithm RewriteLocal creates a new, universally-quantified variable s′ for the
free variable s, and equates s′ to s. The resulting query rewriting for q3 is the following:
The second part of RewriteTree recursively creates a query Qi for each subtree Ti
of T rooted at R. Let ~y0 be the variables at nonkey positions of R (excluding those
that also appear in ~x). Then, one of the conjuncts of the rewritten query returned by
RewriteTree is of the form ∀~y0.R(~x, ~y) → ∧i=1...m Qi(~xi, ~zi). Notice that the variables of
~y0 (i.e., the variables at nonkey positions of the root literal R) are universally quantified.
The intuition behind this is that, as we illustrated in Example 3.2, the query must
be satisfied by all the nonkey values of a given key (in that example, all the possible
departments for the given employee).
Chapter 3. Rewritings for Conjunctive Queries 30
Algorithm RewriteTree(q, Σ)Input: q(~x, ~z), a query in Cforest of the form ∃~w.φ(~x, ~w, ~z),
whose join graph T is a tree with root literal R(~x, ~y)Σ, a set of key constraints, one per relation
Output: Q, a first-order query that computes consistentΣ(q, I) for every database I
Let T be the join graph of qLet R(~x, ~y) be the literal at the root node of TLet qlocal(~x, ~z) = ∃~w.R(~x, ~y)Let Qlocal(~x, ~z) = RewriteLocal(qlocal, Σ)
if φ has exactly one literal thenQ = Qlocal
elseLet R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the children of R in Tfor i := 1 to m do
Let Ti be the subtree of T rooted at Ri
Let φi be the conjunction of literals of Ti
Let ~wi = {w : w is a variable that occurs in φi and ~w,and w 6∈ ~xi}
Let ~zi = {z : z is a variable that occurs in φi and ~z, and z 6∈ ~xi}Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi)Let Qi(~xi, ~zi) = RewriteTree(qi, Σ)
end forLet ~y0 = {y : y is a variable that occurs in ~y and ~w, and y 6∈ ~x}Let Q(~x, ~z) = Qlocal(~x, ~z) ∧ ∀~y0.R(~x, ~y) → ∧
i=1...m Qi(~xi, ~zi)end ifreturn Q
Figure 3.3: Recursive algorithm on the tree structure of the join graph
The next example illustrates an application of the algorithm.
Example 3.4. Let R be a schema with four relation symbols: employee, dept, city,
and prov. Assume that employee has three attributes: emplKey (employee name),
cityFKey (city name), and deptFKey (department name); dept has two attributes:
deptKey (department name) and mgrName (manager’s name); city has two attributes:
cityKey and provFKey; and prov has two attributes: provKey (province name) and
countryName (country name). Assume that there are four key constraints in Σ, stating
that emplKey is the key of the relation employee; cityKey is the key of relation city;
deptKey is the key of the relation dept; and provKey is the key of the relation prov.
Consider a query q4 that retrieves the names of all employees that are located in
Chapter 3. Rewritings for Conjunctive Queries 31
Algorithm RewriteLocal(q, Σ)Input: q(~x, ~z), a query of the form ∃~w.R(~x, ~y), where
none of the variables of ~w appear in ~xΣ, a set of key constraints
Let σ be an injective function mapping natural numbers to variables not present in RInitialize Eq as an empty setfor each position p of ~y do
Let w be the variable that appears at position p of ~yLet z = σ(p)if there is a constant d at position p of ~y then
Add the equality z = d to Eqend ifif w appears in ~x or w appears in ~z then
Add the equality z = w to Eqend iffor every position p′ of ~y such that p 6= p′ and w occurs in ~y at position p′ do
Let z′ = σ(p′)Add the equality z = z′ to Eq
end forend forif Eq 6= ∅ then
Let ~y∗ be a vector of variables of the same arity as ~y, andsuch that if z is at position p of ~y∗, then σ(p) = z
Let Qeq be the conjunction of the equalities of EqLet Qlocal(~x, ~z) = ∃~w.R(~x, ~y) ∧ ∀~y∗.R(~x, ~y∗) → Qeq
elseLet Qlocal(~x, ~z) = ∃~w.R(~x, ~w)
end ifreturn Qlocal
Figure 3.4: Query rewriting for a given literal
Chapter 3. Rewritings for Conjunctive Queries 32
Figure 3.5: Join graph of query q4.
Canada and whose manager is Peter:
q4(e) = ∃d, c, m, p. employee(e, d, c) ∧ city(c, p) ∧ prov(p, Canada) ∧ dept(d, Peter)
The join graph of q4 is given in Figure 3.5. Notice that the join graph of q4 is a tree.
Furthermore q4 has full nonkey-to-key joins and no repeated relation symbols. Thus, q4
is in Cforest.
Let q′′ be the query q′′(c) = ∃p.city(c, p) ∧ prov(p, Canada); let q′′′ be the query
q′′′(p) = prov(p, Canada); and let qIV (d) = dept(d, Peter). The first-order query rewrit-
ing Q4 of q4 is obtained by applying the algorithm RewriteForest(q4, Σ) as follows.
Q4(e) = ∃d, c, m, p.employee(e, d, c) ∧ dept(d,m) ∧ city(c, p) ∧ prov(p, Canada) ∧Qconsist(e)
where :
Qconsist(e) = RewriteTree(q, Σ) =
∃d, c.employee(e, d, c) ∧ ∀d, c.employee(e, d, c) → (Q′′(c) ∧QIV (d))
Q′′(c) = RewriteTree(q′′, Σ) =
∃p.city(c, p) ∧ ∀p.city(c, p) → Q′′′(p)
Q′′′(p) = RewriteTree(q′′′, Σ) =
prov(p, Canada) ∧ ∀w′.(prov(p, w′) → w′ = Canada)
QIV (d) = RewriteTree(qIV , Σ) =
dept(d, Peter) ∧ ∀u′.(dept(d, u′) → u′ = Peter)
Notice the reuse of variables in the rewritten queries. In particular, each existentially-
quantified variable of q4 that appears at a nonkey position in a literal of q4 is first
existentially quantified, and then universally quantified in the rewriting Q4.
Chapter 3. Rewritings for Conjunctive Queries 33
Recall that queries with repeated relation symbols are not allowed in the class Cforest.
We now give an example of a query with repeated relation symbols for which our al-
gorithm fails to give the consistent answers. Although not addressed in this work, it
would be interesting to characterize the class of queries with repeated relation symbols
for which our algorithm is indeed correct.
Example 3.5. Let R be a schema with one relation symbol r, which has three attributes:
A,B, C. Assume that A is the key of the relation r. Let q be the Boolean query
q = ∃x, y, z.r(x, y, a) ∧ r(y, z, b), where a and b are constants. If we apply our query
rewriting algorithm, we obtain the following:
Q = ∃x, y, z.r(x, y, a) ∧ r(y, z, b) ∧ ∀y′, z′.(r(x, y′, z′) → z′ = a)∧
∀y.(r(x, y, a) → ∃z.r(y, z, b) ∧ ∀z′, w′.(r(y, z′, w′) → z′ = b))
Let I be the database instance I = {r(c, d, a), r(d, e, b), r(d, f, a), r(f, g, b)}. In this
case, there are two repairs of I with respect to Σ: I1 = {r(c, d, a), r(d, e, b), r(f, g, b)}and I2 = {r(c, d, a), r(d, f, a), r(f, g, b)}. Clearly, I1 |= q and I2 |= q. However, I 6|= Q.
We finish this section by pointing out that the complexity of the query rewriting
algorithm is linear in the number of literals of the input query. To see this, notice that
the algorithm visits each node of the join graph exactly once.
3.3 Correctness of the Algorithm
In this section, we show that the algorithm RewriteForest presented in the previous
section is correct for all queries in the class Cforest. In particular, we prove the following
theorem.
Theorem 3.5. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z) be a conjunctive query over R such that
q ∈ Cforest. Let Q(~z) be the first-order query returned by RewriteForest(q, Σ). Let I be
an instance over R.
Then, ~t ∈ Q(I) iff ~t ∈ consistentΣ(q, I).
Our proof relies on a few simple properties of repairs of inconsistent databases where
the set of integrity constraints contains a single key dependency per relation. We establish
Chapter 3. Rewritings for Conjunctive Queries 34
these properties in Section 3.3.1. In Section 3.3.2, we show a structural property of the
queries in Cforest that is important in order to guarantee the correctness of the algorithms
RewriteTree and RewriteForest: the literals from distinct trees of the join graph may
only share variables that appear as key attributes at the root of their trees.
In Section 3.3.3, we introduce the notion of a “pessimistic” repair. The name comes
from the fact that, for a given query q and database I, if a tuple fails to satisfy the query
on some repair, then it also fails to satisfy the query on the pessimistic repair. More
precisely, for any inconsistent database I, there is a repair M such that if M |= q(~c),
then consistentΣ(q(~c), I) = true. This enables the algorithm to independently consider
each instantiation of the variables for the key of the root literal.
We then proceed to prove the correctness of the building blocks of the rewriting
algorithm. First, in Section 3.3.4, we prove the correctness of the module RewriteLocal,
for “atomic” queries, that is queries with a single literal (and hence no joins). In Section
3.3.5, we prove the correctness of the recursive algorithm RewriteTree that works on
queries whose join graph is a tree. Finally, in Section 3.3.6, this is generalized to the case
of queries whose join graph is a forest, which gives the correctness proof for the rewriting
algorithm RewriteForest for conjunctive queries in class Cforest.
3.3.1 Properties of Repairs
We first show a few important properties of repairs when the set of integrity constraints
consists of one key dependency per relation. These properties will be used throughout
the proofs of this and the next chapter.
Proposition 3.6. Let I be a database instance. Let I be a repair of I wrt Σ. Then
I ⊆ I.
Proof. Let I ′ be an instance such that I ′ |= Σ. Assume that there is a tuple ~t such that
~t ∈ I ′ and ~t 6∈ I. Let I ′′ = I ′ − {~t}. It is easy to see that by removing tuples from
an instance, we do not introduce violations with respect to a set of key dependencies.
Hence, I ′′ |= Σ. Clearly, ∆(I, I ′′) ⊂ ∆(I, I ′). Therefore, I ′ is not a repair of I wrt Σ.
Proposition 3.7. Let I be an instance. Let I be a repair of I wrt Σ. Let R(~c, ~d) be a
tuple of I. Then, there exists some ~d′ such that R(~c, ~d′) is a tuple of I.
Proof. Let I ′ be an instance such that I ′ |= Σ and R(~c, ~d′) 6∈ I ′, for every ~d′. Let
Chapter 3. Rewritings for Conjunctive Queries 35
I ′′ = I ′ ∪ {R(~c, ~d)}. Since R(~c, ~d′) 6∈ I ′ for every ~d′, I ′′ |= Σ. Clearly, ∆(I, I ′′) =
∆(I, I ′)− {R(~c, ~d)}. Since ∆(I, I ′′) ⊂ ∆(I, I ′), I ′ is not a repair of I wrt Σ.
Proposition 3.8. Let I be an instance. Let R(~c, ~d) be a tuple of I. Then, there exists
some repair I of I such that R(~c, ~d) ∈ I.
Proof. Let I∗ be a repair of I wrt Σ. By Proposition 3.7, there exists ~d′ such that
R(~c, ~d′) ∈ I∗. Let I ′ = I∗−{R(~c, ~d′)}∪ {R(~c, ~d)}. Since I∗ is a repair, I∗ |= Σ. Since I ′
does not introduce any violation to the key dependencies of Σ, I ′ |= Σ. Assume that I ′
is not a repair of I. Then, there exists a repair I∗∗ of I such that ∆(I, I∗∗) ⊂ ∆(I, I ′).
By Proposition 3.6, I∗ ⊆ I, and thus I ′ ⊂ I. Furthermore, by Proposition 3.6, I∗∗ ⊆ I.
Thus, I − I∗∗ ⊂ I − I ′. Therefore, I ′ ⊂ I∗∗. Let I ′′ = I∗∗ − {R(~c, ~d)} ∪ {R(~c, ~d′)}.Clearly, I∗ ⊂ I ′′. Thus, I∗ is not a repair; contradiction.
3.3.2 A Structural Property of Cforest
In the next lemma, we show a structural property of the queries in Cforest that is important
in order to guarantee the correctness of the algorithm. In particular, we show that distinct
trees of the join graph may only share free variables (which do not contribute arcs to the
join graph) or variables that appear as key attributes at the root of their trees.
Lemma 3.9. Let q(~z) be a query such that q ∈ Cforest. Let G be the join graph of q.
Let Ti and Tj be distinct connected components of G. Let Ri(~xi, ~yi) and Rj(~xj, ~yj) be the
literals at the roots of Ti and Tj, respectively. Let w be a variable that occurs in a literal
of both Ti and Tj. Then, either w is free (w ∈ ~z) or w is in the key of the roots of both
trees (w ∈ ~xi ∩ ~xj).
Proof. Let ~wi = {w : w is a variable that occurs in some literal of Ti, w 6∈ ~xi and w 6∈ ~z}.Let ~wj = {w : w is a variable that occurs in some literal of Tj, w 6∈ ~xj and w 6∈ ~z}.Assume that there is some variable w such that w appears in ~wi and ~wj. Let S1(~u1, ~v1)
and S2(~u2, ~v2) be literals of Ti and Tj, respectively such that w appears in S1 and S2.
We must now consider the next two cases. First, suppose that w occurs in ~v1. Then,
by definition of join graph, there is an arc from S1 to S2 in G. But S1 and S2 are in
distinct connected components of G; contradiction. Second, suppose that w occurs in
~u1. By definition of wi, S1 is not at the root of Ti (i.e., S1 6= Ri). Hence, there must
be a nonkey-to-key join from another literal, S3(~u3, ~v3), in Ti to S1. Since q is in Cforest,
Chapter 3. Rewritings for Conjunctive Queries 36
all the nonkey-to-key joins of q are full. Thus, the variable w also appears in a nonkey
position in ~v3. Hence, there must be an arc in the join graph from S3 to S2. But S2 and
S3 are in distinct connected components of G; contradiction.
3.3.3 A “Pessimistic” Repair
In this subsection, we introduce the notion of a “pessimistic” repair. The name comes
from the fact that, for a given query q (in a class that we will define shortly) and database
I, if a tuple fails to satisfy the query on some repair, then it also fails to satisfy the query
on the pessimistic repair. More precisely, for every inconsistent database I, there is a
repair M such that if ~c ∈ q(M), then ~c ∈ consistentΣ(q, I). This is a fundamental
property for the following reason. Consider a Boolean query q = ∃~x, ~w.φ(~x, ~w) and a
query q′(~x) = ∃~w.φ(~x, ~w). That is, q and q′ have the same literals, but some of the
(existentially-quantified) variables of q are free in q′. Suppose that we would like to
check whether consistentΣ(q, I) = true. This holds if, for every repair I of I, I |= q. In
particular, since M is a repair of I, M |= q. Thus, there is some ~c such that ~c ∈ q′(M).
By Lemma 3.10 below, it follows that ~c ∈ consistentΣ(q′, I). This property will be
exploited in the design of our algorithms in order to check the consistency of each tuple
of ~x independently. Notice that the property does not hold in general for conjunctive
queries, as we show in the next example. However, it does hold for the queries that
satisfy the conditions of Lemma 3.10.
Example 3.6. Consider a schema R with two binary relations r1 and r2. Consider a set Σ
that consists of a key dependency for r1 and a key dependency for r2 (the key dependencies
will be obvious from the queries). Let qnk be the Boolean query ∃x, x′, y.r1(x, y)∧r2(x′, y).
Notice that qnk is not in Cforest because it contains a nonkey-to-nonkey join. Let I be an
instance such that I = {r1(a1, b1), r1(a1, b2), r1(a2, b3), r1(a2, b4), r1(a3, b5),
r1(a3, b3), r2(c1, b1), r2(c1, b3), r2(c2, b4), r2(c2, b5), r2(c3, b2), r2(c3, b3)}. It can be checked
that for every repair I of I, I |= qnk.
Now, consider the query q′nk(x) = ∃x′, y.r1(x, y)∧ r2(x′, y). That is, qnk and q′nk differ
only in the fact that x is existentially-quantified in the former, and free in the latter. Let
I1 be repair of I such that I1 = {r1(a1, b1), r1(a2, b3), r1(a3, b5), r2(c1, b3), r2(c2, b4), r2(c3, b3)}.Let I2 be a repair of I such that I2 = {r1(a1, b1), r1(a2, b3), r1(a3, b5), r2(c1, b1), r2(c2, b4),
r2(c3, b2)}. Notice that (a1) 6∈ q′nk(I1), (a2) 6∈ q′nk(I2), and (a3) 6∈ q′nk(I1). Thus, even
though consistentΣ(qnk, I) = true, we have that (a) 6∈ consistentΣ(q′nk, I) = false,
Chapter 3. Rewritings for Conjunctive Queries 37
for every a. Therefore, it is not possible to check whether consistentΣ(qnk, I) = true
by independently checking each instantiation of the free variables of q′nk.
The result that we give below assumes an input query q(~x) that is in Cforest, whose
join graph T is a tree, and whose free variables ~x are exactly the variables of the key of T ’s
root. In the algorithm RewriteForest, the input query will be broken into subqueries
that satisfy this condition.
Lemma 3.10. Let q(~x) be a query in Cforest, whose join graph T is a tree and where
R(~x, ~y) is the literal at the root of T . Let I be an instance. Then, there is a repair Msuch that for all ~c if ~c ∈ q(M), then ~c ∈ consistentΣ(q, I).
Proof. Let M be the instance instance built by invoking the procedure
BuildPessimisticRepair(q, I) given in Figure 3.3.3. Assume that q is of the form
q(~x) = ∃~w.φ(~w, ~x). We will prove the claim by induction on the number of literals of φ.
Base case. Assume that φ consists of exactly one literal R(~x, ~y). Let ~t be the tuple
selected by the algorithm in the iteration for literal R and the vector of values ~c. Assume
towards a contradiction that consistentΣ(∃~w.R(~x, ~w)[~x/~c], I) = false. Then, there is
some repair I of I such that I 6|= ∃~w.R(~x, ~y)[~x/~c]. Since ~t ∈ I and I is a repair of I,
by Proposition 3.7, there is some tuple ~t′ in I and some ~d′ such that ~t′ = R(~c, ~d′). Since
I 6|= ∃~w.R(~x, ~y)[~x/~c], we have that {~t′} 6|= ∃~w.R(~x, ~y)[~x/~c].
Notice that ~t and ~t′ can be added to M only during the iteration for the vector of
values ~c. Since {~t} |= ∃~w.R(~x, ~y)[~x/~c] and {~t′} 6|= ∃~w.R(~x, ~y)[~x/~c], the algorithm never
selects tuple ~t. But ~t ∈M; contradiction.
Inductive step. Assume that φ has more than one literal. Let T1, . . . , Tm be the
subtrees of T such that the root of Tj is a child of the root of T , for 1 ≤ j ≤ m. For each
1 ≤ j ≤ m, let Sj(~xj, ~yj) be the literal at the root of Tj. Let φj be the conjunction of
the literals of Tj. Let ~wj = {w : w is a variable of φj, and w 6∈ ~xj}. Let qj = φj(~xj, ~wj).
Let Mj =BuildPessimisticRepair(φj, I).
Assume that M |= q(~x)[~x/~c]. Let ~t be the tuple of I selected by the algorithm in
the iteration for literal R and the vector of values ~c. Then, ~t ∈ M, and there is some
~d such that ~t = R(~c, ~d). Since M |= q(~x)[~x/~c], we have that for every j such that
1 ≤ j ≤ m, there is some valuation ν for the variables of ~y, and some ~cj such that
ν(~y) = ~d, ν(~xj) = ~cj, and Mj |= qj(~xj)[~xj/~cj].
Chapter 3. Rewritings for Conjunctive Queries 38
Algorithm BuildPessimisticRepair
Input: q(~x), a query in Cforest of the form ∃~w.φ(~w, ~x),whose join graph T is a tree with root literal R(~x, ~y)
Σ, a set of key constraints, one per relationI, an instance
Output: M, a repair of I
Initialize M as an empty instance
if φ has exactly one literal thenfor each ~c such that there is some R(~c, ~d) in I do
if there is some ~d such that R(~c, ~d) ∈ I,
and {R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c] then
Let ~t = R(~c, ~d)else
Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to M
end forelse
/* φ has more than one literal*/Let S1, . . . , Sm be the children of R in Tfor j := 1 to m do
Let Tj be the subtree of T whose root is Sj
Let φj be the conjunction of literals of Tj
Let ~wj = {w : w is a variable that occurs in φj and ~w, and w 6∈ ~xj}Let qj(~xj) = ∃~wj.φj(~xj, ~wj)Let Mj = BuildPessimisticRepair(qj, I)Add Mj to M
end forfor each ~c such that there is some R(~c, ~d) in I do
if there is some ~d, some j, some valuation ν for the variables of ~y,and some ~cj such that R(~c, ~d) ∈ I, ν(~y) = ~d, ν(~xj) = ~cj, andMj 6|= qj(~xj)[~xj/~cj] then
Let ~t = R(~c, ~d)else
Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to M
end forend if
Figure 3.6: Algorithm to construct a “pessimistic” repair
Chapter 3. Rewritings for Conjunctive Queries 39
Assume towards a contradiction that consistentΣ(q(~x)[~x/~c], I) = false. Then, there
is some repair I of I such that I 6|= q(~x)[~x/~c]. Since ~t ∈ I and I is a repair of I, by
Proposition 3.7, there is some tuple ~t′ in I and some ~d′ such that ~t′ = R(~c, ~d′). By Lemma
3.9, none of the variables of ~wi appear in ~wj, for every i and j such that i 6= j, 1 ≤ i ≤ m,
1 ≤ j ≤ m. Thus, there is some j, some valuation ν for the variables of ~y, and some tuple
of values ~c′j such that 1 ≤ j ≤ m, I 6|= qj(~xj)[~xj/~c′j], ν(~y) = ~d′, and ν(~xj) = ~c′j. Thus,
consistentΣ(qj(~xj)[~xj/~c′j], I) = false. By inductive hypothesis Mj 6|= qj(~xj)[~xj/~c
′j].
Since Mj |= qj(~xj)[~xj/~cj], the algorithm never selects ~t in the construction of M. But
~t ∈M; contradiction.
3.3.4 Correctness of RewriteLocal
We now give a correctness proof of RewriteLocal, the module of the algorithm that
handles “atomic” queries, that is queries with a single literal (and hence no joins). These
atomic queries may have arbitrary selections and projections on any subset of the nonkey
attributes (more precisely, any of the nonkey attributes may be projected out of the
query result). We consider here only equality selections, but it is quite easy to see how to
extend the algorithm and the proof to more general selection conditions (including not
only inequalities, but also arbitrary first-order expressions relating the variables of the
literal).
Lemma 3.11. Let q(~x, ~z) be a query of the form ∃~w.R(~x, ~y). Let I be a database instance.
Let Qlocal(~x, ~z) be the first-order query returned by RewriteLocal(q, Σ).
Proof. (⇒) Assume that I |= Qlocal(~x, ~z)[~x/~c][~z/~t]. Then, there is a tuple R(~c, ~d) such
that {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Assume towards a contradiction that
consistentΣ(∃~w.R(~x, ~y)[~x/~c][~z/~t], I) = false. Then, there is some repair I such that
I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.7, there is a tuple R(~c, ~d′) in I.
Following the construction of Qlocal in RewriteLocal, let σ be an injective function
that maps natural numbers to variables not present in R. Let ~y∗ be a vector of variables
of the same arity as ~y and such that if z is at position p of ~y∗, then σ(p) = z. Let ν and
ν ′ be valuations for the variables of ~x and ~y∗ such that ν(~x) = ~c, ν(~y∗) = ~d, ν ′(~x) = ~c,
and ν ′(~y∗) = ~d′.
Since {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t] and {R(~c, ~d′)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t], there
is some variable z at some position p of ~y∗ such that
Chapter 3. Rewritings for Conjunctive Queries 40
1. ν(z) 6= ν ′(z), and there is a constant at position p in ~y; or
2. ν(z) 6= ν ′(z), and there is some variable w such that w occurs at position p of ~y,
and w occurs in either ~x or ~z; or
3. there are variables w and z′, and a position p′ such that w occurs at position p of
~y, w occurs at position p′ of ~y, p 6= p′, z′ = σ(p′), and ν ′(z) 6= ν ′(z′).
Assume (1) that there is a constant d at position p in ~y. Since
{R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t], ν(z) = d. Since ν(z) 6= ν ′(z), there is a constant d′
such that d 6= d′ and ν ′(z) = d′. Notice in the algorithm RewriteLocal that since I |=Qlocal(~x, ~z)[~x/~c][~z/~t], we have that I |= ∀~y∗.R(~x, ~y∗) → z = d. Since I ⊆ I, R(~c, ~d′) ∈ I.
Thus, {R(~c, ~d′)} |= ∀~y∗.R(~x, ~y∗) → z = d. Therefore, ν ′(z) = d; contradiction.
Assume (2) that there is some variable w such that w occurs at position p of ~y,
and w occurs in either ~x or in ~z. Let c = ν(w). Since {R(~c, ~d)} |= ∃~w.R(~x, ~y∗)[~x/~c][~z/~t],
ν(z) = c. Since ν(z) 6= ν ′(z), ν ′(z) 6= c. Notice in the algorithm RewriteLocal that since
I |= Qlocal(~x, ~z)[~x/~c][~z/~t], we have that I |= ∀~y∗.R(~x, ~y∗) → z = w[w/c]. Since I ⊆ I,
R(~c, ~d′) ∈ I. Thus, {R(~c, ~d′)} |= ∀~y∗.R(~x, ~y∗) → z = w[w/c]. Therefore, ν ′(z) = c;
contradiction.
Assume (3) that there are variables w and z′, and a position p′ such that w occurs
at position p of ~y, w occurs at position p′ of ~y, p 6= p′, z′ = σ(p′), and ν ′(z) 6= ν ′(z′).
Notice in the algorithm RewriteLocal that since I |= Qlocal(~x, ~z)[~x/~c][~z/~t], we have that
I |= ∀~y∗.R(~x, ~y∗) → z = z′. Since I ⊆ I, R(~c, ~d′) ∈ I. Thus, {R(~c, ~d′)} |= ∀~y∗.R(~x, ~y∗) →z = z′. Therefore, ν ′(z) = ν ′(z′); contradiction.
(⇐) Assume that consistentΣ(q(~x, ~z)[~x/~c][~z/~t], I) = true. Assume towards a con-
tradiction that I 6|= Qlocal(~x, ~z)[~x/~c][~z/~t]. Then, at least one of the following conditions
hold:
1. I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]; or
2. there is a constant d at position p in ~y and a variable z such that z = σ(p) and
I 6|= ∀~y∗.R(~x, ~y∗) → z = d[~x/~c][~z/~t]; or
3. there is some variable w such that w occurs at position p of ~y, w occurs in either
~x or ~z, and I 6|= ∀~y∗.R(~x, ~y∗) → z = w[~x/~c][~z/~t]; or
4. there is some variable w that occurs at position p of ~y, and at a position p′ of ~y
such that p 6= p′, σ(p) = z, σ(p′) = z′ and I 6|= ∀~y∗.R(~x, ~y∗) → z = z′[~x/~c][~z/~t].
Chapter 3. Rewritings for Conjunctive Queries 41
Assume that I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let I be an arbitrary repair of I. Since I ⊆ I,
I 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]; contradiction.
Suppose that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Furthermore, assume that there is a constant
d at position p in ~y and a variable z such that z = σ(p) and I 6|= ∀~y∗.R(~x, ~y∗) → z =
d[~x/~c][~z/~t]. Then, there is a tuple R(~c, ~d) in I such that {R(~c, ~d)} 6|= ∀~y∗.R(~x, ~y∗) →z = d[~x/~c][~z/~t]. This means that there is some constant e at position p of ~d such that
d 6= e. Thus, {R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.8, there is a repair Iof I such that R(~c, ~d) ∈ I. Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let R(~c, ~d′) be a
tuple of I such that {R(~c, ~d′)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Since I is a repair of I, I satisfies
the key constraints of Σ. Thus, ~d = ~d′. Therefore, {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t];
contradiction.
Suppose that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Furthermore, assume that there is some
variable w such that w occurs at position p of ~y, w occurs in either ~x or ~z, and I 6|=∀~y∗.R(~x, ~y∗) → z = w[~x/~c][~z/~t]. Then, there is a tuple R(~c, ~d) in I such that {R(~c, ~d)} 6|=∀~y∗.R(~x, ~y∗) → z = w[~x/~c][~z/~t]. Let ν be a valuation for the variables of ~x and ~z such
that ν(~x) = ~c and ν(~z) = ~t. Let c = ν(w). Then, there is some constant e at position p of
~d such that c 6= e. Thus, {R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.8, there is a
repair I of I such that R(~c, ~d) ∈ I. Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let R(~c, ~d′) be
a tuple of I such that {R(~c, ~d′)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Since I is a repair of I, I satisfies
the key constraints of Σ. Thus, ~d = ~d′. Therefore, {R(~c, ~d)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t];
contradiction.
Suppose that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Furthermore, assume that there is some
variable w that occurs at position p of ~y, and at a position p′ of ~y such that p 6= p′,
σ(p) = z, σ(p′) = z′ and I 6|= ∀~y∗.R(~x, ~y∗) → z = z′[~x/~c][~z/~t]. Then, there is a
tuple R(~c, ~d) in I such that {R(~c, ~d)} 6|= ∀~y∗.R(~x, ~y∗) → z = z′[~x/~c][~z/~t]. Let ν
be a valuation for the variables of ~y∗ such that ν(~y∗) = ~d. Then, there are con-
stants d and e at the respective positions p and p′ of ~d such that d 6= e. Thus,
{R(~c, ~d)} 6|= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. By Proposition 3.8, there is a repair I of I such that
R(~c, ~d) ∈ I. Assume that I |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Let R(~c, ~d′) be a tuple of I such that
{R(~c, ~d′)} |= ∃~w.R(~x, ~y)[~x/~c][~z/~t]. Since I is a repair of I, I satisfies the key constraints
We are now ready to give the correctness proof of our rewriting algorithm, for all queries
in class Cforest. The intuition of the proof is the following. Assume that we are given
a query q in Cforest. Then, each of the connected components of the join graph of q
is a tree. Recall that RewriteTree, the algorithm for which we proved correctness in
the above lemma, requires that the input query satisfies the following conditions. First,
the join graph of the query must be a tree. Second, the free variables of the query
must include all the variables at key positions of the literal at the root of this tree.
In order to be able to use RewriteTree, RewriteForest produces a subquery for each
tree of the join graph such that the variables at the key of the corresponding tree’s
root are free. In this way, a first-order rewriting can be produced for each subquery by
invoking the algorithm RewriteTree. For each i, let Qi(~xi, ~zi) be the rewriting obtained
by invoking RewriteTree(qi, Σ). The query returned by RewriteForest has the form
Q(~z) = ∃~w.(φ(~w, ~z) ∧ ∧i=1...m Qi(~xi, ~zi)), where φ(~w, ~z) is the conjunction of literals of
the original query q, and the variables of each ~xi are in ~w. The correctness of this formula
relies on the structural property of Section 3.3.2 and the notion of a “pessimistic” repair of
Section 3.3.3. First, by Lemma 3.10, it suffices to find one instantiation for the variables
of each ~xi. Thus, the variables of ~xi can be free in Qi. Second, the subqueries do not
share existentially-quantified variables. This is ensured by the structural property proved
in Lemma 3.9.
Chapter 3. Rewritings for Conjunctive Queries 45
Theorem 3.5. Let R be a schema. Let Σ be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(~z) be a conjunctive query over R such
that q ∈ Cforest. Let Q(~z) be the first-order query returned by RewriteForest(q, Σ). Let
I be an instance over R.
Then, ~t ∈ Q(I) iff ~t ∈ consistentΣ(q, I).
Proof. Let G be the join graph of q. Since q ∈ Cforest, G is a forest. Let T1, . . . , Tm be
the connected components (trees) of G. Assume that q is of the form ∃~w.φ(~w, ~z), where
φ is a conjunction of literals. For each 1 ≤ i ≤ m, let Ri(~xi, ~yi) be the literal at the root
of Ti. Let φi be the conjunction of the literals of Ti. Let ~wi = {w : w is a variable that
occurs in φi and ~w, and w 6∈ ~xi}. Let ~zi = {z : z is a variable that occurs in φi and ~z,
and z 6∈ ~xi}. Let qi(~xi, ~zi) = ∃~wi.φi(~xi, ~wi, ~zi). Let Qi(~xi, ~zi) = RewriteTree(qi, Σ).
(⇒) Assume that I |= Q(~z)[~z/~t]. Then, there is a valuation ν for the variables of φ
such that:
1. ν(~z) = ~t, and
2. I |= φ(~w, ~z)[ν], and
3. for every i such that 1 ≤ i ≤ m, there are ~ci and ~ti such that ν(~xi) = ~ci, ν(~zi) = ~ti,
and I |= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]
Let I be a repair of I. Assume towards a contradiction that I 6|= q[~z/~t]. Thus,
I 6|= q[ν]. By Lemma 3.9, none of the variables of ~wi appear in ~wj, for every i and j
such that i 6= j, 1 ≤ i ≤ m, 1 ≤ j ≤ m. Then, I 6|= qi(~xi, ~zi)[~xi/~ci][~zi/~ti] for some i such
that 1 ≤ i ≤ m. Thus, consistentΣ(qi(~xi, ~zi)[~xi/~ci][~zi/~ti], I) = false. By Lemma 3.12,
I 6|= Qi(~xi, ~zi)[~xi/~ci][~zi/~ti]; contradiction.
(⇐) Assume that ~t ∈ consistentΣ(q, I). Assume towards a contradiction that I 6|=Q(~z)[~z/~t]. Let ν be a valuation for the variables of φ such that ν(~z) = ~t. Then, either
(1) I 6|= q(~z)[ν]; or (2) there is some i such that I 6|= Qi(~xi, ~zi)[ν].
We will build a repair M of I as follows. For each i, let Ii be the projection of
I on the relation symbols of φi. By Lemma 3.10, there is a repair Mi such that if
Mi |= qi(~xi)[~xi/~ci], then consistentΣ(qi(~xi)[~xi/~ci], Ii) = true. We add all the tuples of
Mi to M.
We now show that M 6|= q(~z)[ν]. Assume that I 6|= q(~z)[ν]. Since M ⊆ I, M 6|=q(~z)[ν]. Now, assume that there is some i such that 1 ≤ i ≤ m and I 6|= Qi(~xi, ~zi)[ν]. By
Chapter 3. Rewritings for Conjunctive Queries 46
Lemma 3.12, consistentΣ(qi(~xi, ~zi)[ν], I) = false. By Lemma 3.10, Mi 6|= qi(~xi, ~zi)[ν].
Thus, M 6|= q(~z)[ν].
So, for every valuation ν such that ν(~z) = ~t, we have that M 6|= q(~z)[ν]. Thus,
~t 6∈ consistentΣ(q, I); contradiction.
3.4 Related Work
In their seminal paper on consistent query answering, Arenas, Bertossi and Chomicki
[ABC99] propose a first-order rewriting algorithm. The algorithm applies to a broad
class of constraints but a restricted class of queries, called quantifier-free conjunctive
queries. In these queries, all variables are free (i.e., there is no existential quantification).
If we think in terms of equivalent SQL queries, the fact that all variables are free means
that every attribute of every relation in the from clause must appear in the select
clause. This a strong restriction that rules out many practical queries. As an empirical
observation, none of the queries in the TPC-H specification [TPC03], the industry stan-
dard for decision support systems, satisfy this restriction. Chomicki and Marcinkowski
[CM05] propose a rewriting for another restricted class, where no variables are shared
between literals (and therefore, there are no joins). In this chapter, we focused on a class
of conjunctive queries that may have existential quantification, and we argued that the
class captures many queries that arise in practice.
Except for the aforementioned work [ABC99, CM05], to the best of our knowledge,
none of the work in the consistent query answering literature has focused on first-order
rewritings. Instead, they typically produce rewritings into disjunctive logic programs
{(1000, 3)} and q1(I2) = {(1000, 2), (2000, 1)}. By Definition 2.4, aggconsistentΣ(q1, I) =
Chapter 4. Rewritings for Queries with Grouping and Aggregation 52
{(1000, 2, 3)}. That is, the salary 1000 is an answer that appears at least twice and at
most three times in the result of applying q1 on the repairs.
Let us focus on obtaining the greatest lower bound for q1. From the previous chapter,
we know how to obtain consistent answers for conjunctive queries without aggregation
under set-theoretic semantics. We would like to reuse such results here. An obvious
strategy (shown to be incorrect shortly) is to first remove grouping and aggregation
from q1, obtain the consistent answers under set-theoretic semantics, and finally apply
grouping and aggregation to the intermediate result. That is, first compute the consistent
answers for the following query q′1(s):
select s
from employee(e, s)
We can express q′1 in conjunctive query notation as follows: q′1(s) = ∃e. employee(e, s).
Let QConsistent′(s) be the first-order query obtained by applying RewriteForest(q′1, Σ),
the algorithm introduced in the previous chapter. Suppose that now apply the operator
count(*) to the the result of QConsistent′(s) as follows:
select s, count(*)
from QConsistent′(s)
group by s
It is easy to see that this strategy leads to a wrong result. Since the result of the
consistent answers to q′1 (consistentΣ(q′1, I)) is {(1000)}, we would incorrectly conclude
that the greatest lower bound for 1000 is one, when in fact it is two. Clearly, the cause
for the incorrect result is that cardinalities are lost in the set-theoretic consistent answers
that we computed as an intermediate step. But, is there any way of obtaining the correct
bounds for the aggregate query, and yet be able to reuse the notion of set-theoretic
consistent answers as an intermediate step? The answer is positive: we can use a “root
key value at a time” principle. In this case, this corresponds to making the variable e
(for employee name) free because it is at the key position of employee(e, s), the literal
at the root (and only node) of q′1. We will obtain the consistent answer one employee
at a time in the intermediate result, and then project out the employees (since they
are not retrieved by q1). The intermediate result will be guaranteed to have the correct
cardinalities despite the fact that it is obtained using set semantics. The intuitive reason
Chapter 4. Rewritings for Queries with Grouping and Aggregation 53
is that repairs are sets of tuples that satisfy the key constraints, and hence every employee
name appears exactly once in each repair.
Following the previous discussion, let q′′1 be the query q′1, where the variable e is made
free. That is, let q′′1(e, s) = employee(e, s). The set-theoretic consistent answers for q′′1 are
consistentΣ(q′′1 , I) = {(Mary, 1000), (Ali, 1000)}. We can now project out the employee
names and count the number of occurrences of salary 1000, arriving at the correct lower
bound for count(*) in q1.
Let us now turn our attention to the computation of the lowest upper bound of q1.
Since aggconsistentΣ(q1, I) = (1000, 2, 3), the salary 1000 is an answer that appears
at most three times in the results of applying q1 to the repairs. We can use q′′1(e, s) =
employee(e, s) to obtain the lowest upper bound of salary 1000 as follows:
select s, count(*) as lub
from q′′1(e, s)
group by s
However, this query also retrieves the tuple (2000, 1) which should not be in the result
of aggconsistentΣ(q1, I) because the salary 2000 does not appear in q1(I1). This means
that we must make sure that the values for the grouping variables are in the consistent
answers for q′′1 . We can do this by employing the first-order rewriting QConsistent(e, s)
of query q′′1 , which can be obtained by invoking the algorithm RewriteForest. Now, we
can rule out 2000 from the final result because there is no tuple for salary 2000 in the
result of QConsistent(e, s). This can be achieved with the following query:
select s, count(*) as lub
from employee(e, s) ∧ ∃e′.QConsistent(e′, s)group by s
Query Rewriting Algorithm
In Figure 4.1, we give the rewriting algorithm for aggregate conjunctive queries with
the count(∗) aggregation function. The algorithm works for queries q of the form
select ~z, count(*)
from q∗(~z)
group by ~z
Chapter 4. Rewritings for Queries with Grouping and Aggregation 54
where q∗ is a conjunctive query in Cforest. The reason for requiring q∗ to be in Cforest is
that, as we motivated in the previous example, we would like to build upon the results for
first-order rewriting of conjunctive queries under set-theoretic semantics. In the previous
chapter, we showed how to obtain such rewritings for the conjunctive queries in class
Cforest.
By definition, the join graph of all queries in Cforest is a forest. We can then instantiate
the values for the key attributes at each root literal of the join graph of q∗, using the
“root key value at a time” strategy that we illustrated in the previous example. More
precisely, let G be the join graph of q∗. We will construct a conjunctive query q′ that
has the same literals as q∗, but all the variables that are at the key of some root of G are
free in q′.
Following the algorithm, let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots
of all trees in G. Let ~x =⋃
i=1...m ~xi, let ~z′ = ~z − ~x. Let φ(~w, ~z) be the conjunction
of literals of q∗, and let ~w′ = ~w − ~x. We define q′ as q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). The
advantage of query q′ is that since the variables at the key of all root literal are free,
each tuple appears exactly once in the answer to q′ in the repairs (we will show this
formally in Lemma 4.4). Thus, set and bag-set semantics coincide in the answer to q′.
We can exploit this fact by computing the set-theoretic consistent answers for q′ as an
intermediate result towards producing the consistent answers to the aggregate query q.
The first-order query rewriting QConsistent for q′ is obtained by invoking the algorithm
RewriteForest given in Figure 3.2 of Chapter 3.
The greatest lower bound is computed with the following query, which counts the
number of occurrences of tuples for ~z (the grouping variables) in the consistent answer
to q′.
QGlb(~z, low) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Notice that the free variables of QConsistent, ~x and ~z′, contain the variables of ~z, but
may have additional variables. In the final result, we are projecting out these additional
variables, since they are not in the select clause of the query q.
The lowest upper bound is obtained by counting the number of tuples that satisfy
q′(~x, ~z′) and checking that some instantiation of the grouping variables of ~z appear in the
Chapter 4. Rewritings for Queries with Grouping and Aggregation 55
RewriteCount(q, Σ)
Input: A query q of the form
select ~z, count(*)
from q∗(~z)
group by ~z
where q∗ is a conjunctive query in Cforest
Σ, a set of key constraints (one per relation)
Output: Q, an aggregate first-order query that computes aggconsistentΣ(q, I)
for every database I
Let G be the join graph of q
Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots of all trees of G
Let ~x =⋃
i=1...m ~xi
Let ~z′ = ~z − ~x
Let φ(~w, ~z) be the conjunction of literals of q∗
Let ~w′ = ~w − ~x
Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′)
Let QConsistent(~x, ~z′) be the query obtained by invoking RewriteForest(q′, Σ)
Let QGlb(~z, low) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Let ~x′ = ~x− ~z
Let QLub(~z, up) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
Let Q(~z, low, up) = QGlb(~z, low) ∧ QLub(~z, up)
return Q
Figure 4.1: Query rewriting algorithm for queries with count(*).
Chapter 4. Rewritings for Queries with Grouping and Aggregation 56
consistent answers of q′. This is obtained with the query ∃~x′.QConsistent(~x, ~z′), where
~x′ are the variables of ~x that are not free variables of q.
QLub(~z, up) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
4.2.2 Queries with the sum, min, and max Functions
In Figure 4.2, we present the query rewriting algorithm for queries with the sum, min,
and max aggregation functions. The main difference with the rewritings produced by
RewriteCount is that aggregation is performed here in two levels. At the inner level of
the rewriting, we aggregate the values for u (the value that is aggregated in the original
query), and we group by the key-root attributes (vector ~x in the figure). We then project
out the key-root attributes that are not in the select clause of the input query, and
apply the aggregation function of the input query.
For example, the greatest lower bound of the max function is computed as follows:
QGlb(~z, low) =
select ~z, max(bottom)
from
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
group by ~z
Notice that, as in RewriteCount, the lower bound is obtained by selecting tuples from
QConsistent(~x, ~z′). In addition, we now have a conjunct q′′(~x, ~z′, u), which retrieves the
values for the aggregate attribute u. The inner level of aggregation consists in this case
of the computation of the bottom attribute, as the minimum for the values retrieved for
u. The outer level applies the max function (i.e., the function of the original query) to
the values of the bottom attribute.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 57
RewriteAgg(q, Σ)
Input: A query q of the form
select ~z, [max(u)|min(u)|sum(u)]from q∗(~z, u)
group by ~z
where q∗ is a conjunctive query in Cforest
Σ, a set of key constraints (one per relation)
Output: Q, an aggregate first-order query that computes aggconsistentΣ(q, I)
for every database I
Let G be the join graph of q
Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots of all trees of G
Let ~x =⋃
i=1...m ~xi
Let ~z′ = ~z − ~x
Let φ(~w, ~z, u) be the conjunction of literals of q∗
Let ~w′ = ~w − ~x
Let q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u)
Let QConsistent(~x, ~z′) be the query obtained by invoking RewriteForest(q′,Σ)
Let q′′(~x, ~z′, u) = ∃ ~w′.φ(~x, ~w′, ~z′, u)
Let ~x′ = ~x− ~z − u
if the aggregate function is max then
QGlb(~z, low) =
select ~z, max(bottom)
from
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
group by ~z
QLub(~z, up) =
select ~z, max(top)
from
select ~x, ~z′, max(u) as top
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
group by ~z
endif
continues on next page...
Figure 4.2: Query rewriting algorithm for queries with aggregation
Chapter 4. Rewritings for Queries with Grouping and Aggregation 58
continued from previous page...
if the aggregate function is sum then
QGlb(~z, low) =
select ~z, sum(bottom)
from
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
having bottom ≥ 0
∨select ~x, ~z′, min(u) as bottom
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
having bottom < 0
group by ~z
QLub(~z, up) =
select ~z, sum(top)
from
select ~x, ~z′, max(u) as top
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
having top > 0
∨select ~x, ~z′, max(u) as top
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
having top ≤ 0
group by ~z
endif
continues on next page...
Figure 4.2: Query rewriting algorithm for queries with aggregation
Chapter 4. Rewritings for Queries with Grouping and Aggregation 59
continued from previous page...
if the aggregate function is min then
QGlb(~z, low) =
select ~z, min(bottom)
from
select ~x, ~z′, min(u) as bottom
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
group by ~z′
QLub(~z, up) =
select ~z, min(top)
from
select ~x, ~z, max(u) as top
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
group by ~z
endif
Let Q(~z, low, up) = QGlb(~z, low) ∧ QLub(~z, up)
return Q
Figure 4.2: Query rewriting algorithm for queries with aggregation
Chapter 4. Rewritings for Queries with Grouping and Aggregation 60
4.3 Correctness of the Algorithms
In this section, we prove the correctness of the query rewriting algorithms of this chapter.
We consider the following class of queries, which we call Caggforest.
Definition 4.1. Let q be an aggregate conjunctive query. We say that q is in class
Caggforest if q is of the form
select ~z, [count(*)| F(u)]
from q∗(~z, u)
group by ~z
where q∗ is a conjunctive query in Cforest, and F is one of the aggregation functions
min, max or sum.
The main result of this section is the following theorem:
Theorem 4.2. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z, v) be a query in Caggforest. Let Q(~z, l, u)
be the first-order aggregate query returned by RewriteCount(q, Σ) or RewriteAgg(q, Σ)
(depending on the aggregate function of the query).
Let I be an instance over R. If q has the aggregate function sum, assume that the
aggregated attribute ranges over positive numbers on I.
Then, for every tuple ~t, and pair of real numbers low and up, we have that (~t, low, up) ∈aggconsistentΣ(q, I) iff (~t, low, up) ∈ Q(I).
Notice that for the sum operator we have an additional requirement: the aggregated
variable must take only positive numbers. The rewriting for sum, however, does produce
sound bounds for arbitrary numbers (positive or negative), as we prove in Section 4.3.3.
The algorithms use the first-order query rewritings of the previous chapter as a build-
ing block. The semantics of those rewritings is set-theoretic, whereas the aggregate
functions we consider in this chapter take bags as input. In Section 4.3.1, we show that
for a subclass of the conjunctive queries in Cforest, the cardinality of the query results on
every repair is exactly one. Thus, for this subclass, it is not necessary to keep track of
tuple multiplicities in the intermediate results. Recall that in Chapter 3, we showed that
for every query q in a subclass of Cforest, there is a “pessimistic” repairM such that q(M)
Chapter 4. Rewritings for Queries with Grouping and Aggregation 61
retrieves all the consistent answers to q. We will use the notion of pessimistic repair to
prove that the bounds produced by the rewritings are tight. We will also need the dual
notion of an “optimistic” repair, which we introduce in Section 4.3.2. In Section 4.3.3, we
show that the ranges produced by the query rewritings are sound, in the sense that the
value of the aggregation function falls within the range on every repair. In Section 4.3.4,
we show that the ranges produced by the query rewritings are tight, in the sense that
they are satisfied in at least one repair. Finally, in Section 4.3.5 we put it all together,
and give the proof of correctness of the rewritings.
4.3.1 Building Upon First-Order Rewritings
The semantics of first-order rewritings is set-theoretic, whereas aggregate functions take
bags as input. In this subsection, we show that for a class of conjunctive queries that
is relevant in the query rewriting algorithms, the cardinality of the tuples in the result
of applying a query to the repairs is always one. As a consequence, for such queries, it
suffices to obtain a set-theoretic first-order rewriting. The result of applying the first-
order rewriting to the inconsistent database can be used as an intermediate step towards
obtaining the consistent answers for conjunctive queries with aggregation.
The queries with the aforementioned property are the conjunctive queries in Cforest,
where all the variables at key positions of some root of the join graph are free. The
proof is given in Lemma 4.4. The lemma makes use of an auxiliary result, that we give
next, which focuses on queries in Cforest that satisfy the additional condition that the
join graph must be a tree (instead of a forest). Intuitively, we show that in each repair
I, each tuple ~t in the query result is obtained “due to” the same set of tuples in I. More
formally, we show that if S and S ′ are sets that contain exactly one tuple per relation of
I and such that ~t ∈ q(S) and ~t ∈ q(S ′), then S ′ = S.
Lemma 4.3. Let q(~z) be a query in Cforest. Assume that the join graph T of q is a
tree, and that all the variables at key positions of the literal at the root of T are free in q
(that is, there is a literal R(~x, ~y) at the root of T such that ~x ⊆ ~z). Let I be a database
instance over the schema of q, and Σ be a set consisting of at most one key dependency
per relation of q. Let I be a repair of I wrt Σ. Let S and S ′ be sets that contain exactly
one tuple per relation of I and such that ~t ∈ q(S), and ~t ∈ q(S ′). Then, S ′ = S.
Proof. The proof is by induction on the number of literals of q.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 62
Base case. Assume that q has exactly one literal. Assume towards a contradiction
that S 6= S ′. Then, there are distinct tuples ~t0 and ~t′0 in I such that ~t ∈ q({~t0}) and
~t ∈ q({~t′0}). Let R(~x, ~y) be the only literal of q. Since all the variables at key positions of
the root literal of T are free, and ~z are the free variables of q, we have that ~x ⊆ ~z. Thus,
there are vectors of values ~c, ~d and ~d′ such that ~d 6= ~d′, ~t0 = R(~c, ~d), and ~t′0 = R(~c, ~d′).
Thus, I 6|= Σ. But I is a repair of I wrt Σ; contradiction.
Inductive step. Assume that q has more than one literal. Let R be a literal of q
that appears at a leaf of T (recall that T is a tree). Let ~t0 and ~t′0 be tuples of S and S ′,
respectively, such that ~t0 = R(~c, ~d) and ~t′0 = R(~c′, ~d′).
Let M be a set that consists of all the tuples of S, except the one for literal R.
Let M ′ be a set that consists of all the tuples of S ′, except the one for literal R. By
inductive hypothesis, M = M ′. Notice that M and M ′ are the only subsets of S and S ′,
respectively, that satisfy these conditions since S and S ′ contain exactly one tuple per
relation of I.
Let R′(~x′, ~y′) be the parent of R in T . Then, there is a tuple ~t1 in R′ and valuations
ν and ν ′ such that ~t1 ∈ S, ~t1 ∈ S ′, {~t0,~t1} |= R′(~x′, ~y′) ∧ R(~x, ~y)[~z/~t][ν], and {~t′0,~t1} |=R′(~x′, ~y′) ∧ R(~x, ~y)[~z/~t][ν ′]. Notice that ν(~y′) = ν ′(~y′). Since q ∈ Cforest, there is a full
nonkey-to-key join from R′ to R. Thus, all the variables of ~y′ appear in ~x. Therefore,
ν(~x) = ν ′(~x); and ~c = ~c′. Assume towards a contradiction that ~t0 6= ~t′0. Then, there are
tuples R(~c, ~d) and R(~c′, ~d′) in I such that ~c = ~c′ and ~d 6= ~d′. This means that I 6|= Σ.
But I is a repair of I wrt Σ; contradiction.
In the next lemma, we show that for queries in Cforest such that the variables at key
positions of all root literals are free, the cardinality of each tuple in the query result is
exactly one.
Lemma 4.4. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z) be a conjunctive query over R such that
q ∈ Cforest. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals
at the root of each connected component (tree) of G. Assume that ~x1, . . . , ~xm are free
variables in q (i.e., they occur in ~z).
Let I be an instance over R. Let I be a repair of I wrt Σ. Let B be a bag such that
B = q(I) under bag semantics. Let ~t be such that ~t ∈ q(I). Then, |~t|B = 1.
Proof. Assume towards a contradiction that |~t|B > 1. Then, there are distinct sets S and
S ′ that contain exactly one tuple per literal of q and such that ~t ∈ q(S), and ~t ∈ q(S ′).
Chapter 4. Rewritings for Queries with Grouping and Aggregation 63
Since q ∈ Cforest, G is a forest. For each 1 ≤ i ≤ m, let Ti be the tree whose root is Ri.
Let φi(~w, ~z) be the conjunction of the literals of Ti. Let qi(~z) = ∃~w.φi(~w, ~z). Recall that
~xi (the variables at the key of the root literal of Ti) are free, and therefore occur in ~z.
Thus, qi satisfies the conditions of Lemma 4.3.
Since S 6= S ′, ~t ∈ q(S), and ~t ∈ q(S ′), there must be some i and some sets M and M ′
such that M 6= M ′, M ⊆ S, M ′ ⊆ S ′, M and M ′ have one tuple for each relation symbol
in φi, ~t ∈ qi(M), and ~t ∈ qi(M′). But this contradicts Lemma 4.3 above.
4.3.2 An “Optimistic” Repair
Recall that in Chapter 3 we showed that for every query q in a subclass of Cforest, there is
a “pessimistic” repair M such q(M) retrieves all the consistent answer to q. In Section
4.3.4, we will use M to prove the tightness of the query rewritings. For example, if
we apply an aggregate query on M, the value that we get for the count(*) aggregate
function corresponds to the greatest lower bound computed by the rewriting produced
by RewriteCount(q, Σ).
For the lowest upper bound, we will need the notion of an “optimistic” repair N . The
name “optimistic” comes from the fact that in this repair, if a tuple ~t can be obtained
from some repair of the inconsistent database, then the tuple is also in q(N ). In Lemma
4.6, we show the existence of such a repair.
Before proving the existence of the optimistic repair, we formally define the notion
of possible answers. This notion can be considered as dual to the notion of consistent
answers. While a consistent answer is one that holds in the query results obtained from
all the repairs, a possible answer is one that holds in the query result from at least one
repair.
Definition 4.5 (Possible Answers). Let R be a schema. Let Σ be a set of integrity
constraints. Let I be an instance over R (possibly inconsistent with respect to Σ). Let
q be a query over R. We say that a tuple ~t is a possible answer for q with respect to Σ
if there exists a repair I of I with respect to Σ such that ~t ∈ q(I). We denote this as
~t ∈ possibleΣ(q, I).
For a Boolean query q over R, we say that possibleΣ(q, I) = true if there exists a
repair I of I with respect to Σ such that I |= q. We say that possibleΣ(q, I) = false if
for every repair I of I with respect to Σ, I 6|= q.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 64
Lemma 4.6. Let q(~x) be a query in Cforest, whose join graph T is a tree and where
R(~x, ~y) is the literal at the root of T . Let I be an instance. Then, there is a repair Nsuch that for all ~c if ~c ∈ possibleΣ(q, I), then ~c ∈ q(N ).
Proof. Let N be the instance instance built by BuildOptimisticRepair(q, I) (the al-
gorithm given in Figure 4.3). We will prove the claim by induction on the number of
literals of q.
Base case. Assume that q consists of exactly one literal R(~x, ~y). Let ~t be the
tuple selected by the algorithm in the iteration for literal R and the vector of values ~c.
Assume towards a contradiction that N 6|= ∃~w.R(~c, ~y). Then, {~t} 6|= ∃~w.R(~c, ~y). Since
possibleΣ(∃~w.R(~c, ~y), I) = true, there is some repair I of I such that I |= ∃~w.R(~c, ~y).
Thus, there is a tuple ~t′ such that {~t′} |= ∃~w.R(~c, ~y). Notice that ~t and ~t′ can be added
to N only during the iteration for the vector of values ~c. Since {~t} 6|= ∃~w.R(~c, ~y) and
{~t′} |= ∃~w.R(~c, ~y), the algorithm never selects tuple ~t. But ~t ∈ N ; contradiction.
Inductive step. Assume that q has more than one literal. Let φ(~w, ~x) be the
conjunction of literals of q. Let T1, . . . , Tm be the subtrees of T such that the root of
Tj is a child of the root of T , for 1 ≤ j ≤ m. For each 1 ≤ j ≤ m, let Sj(~xj, ~yj)
be the literal at the root of Tj. Let φj be the conjunction of the literals of Tj. Let
~wj = {w : w is a variable of φj, and w 6∈ ~xj}. Let qj(~xj) = ∃~wj.φj(~xj, ~wj). Let
Nj = BuildOptimisticRepair(qj, I).
Assume towards a contradiction that ~c 6∈ q(N ). Let ~t be the tuple of I selected by the
algorithm in the iteration for literal R and the vector of values ~c. Then, ~t ∈ N , and there
is some ~d such that ~t = R(~c, ~d). Since ~c 6∈ q(N ), there must be some j, some valuation
ν for the variables of ~y, and some ~cj such that 1 ≤ j ≤ m, ν(~y) = ~d, ν(~xj) = ~cj, and
~cj 6∈ qj(Nj).
Since possibleΣ(q(~c), I) = true, there is some repair I of I such that ~c ∈ q(I).
Thus, there is some tuple ~t′ in I, some ~d′, and some valuation ν for the variables of ~y
such that ~t′ = R(~c, ~d′), ν(~y) = ~d′, and the following condition holds: for every j and
tuple of values ~c′j such that 1 ≤ j ≤ m and ν(~xj) = ~c′j, we have that ~c′j ∈ qj(I). Thus,
possibleΣ(qj(~cj), I) = true. By inductive hypothesis ~c′j ∈ qj(Nj). Thus, the algorithm
selects ~t′ in the construction of N , rather than ~t. But ~t ∈ N ; contradiction.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 65
Algorithm BuildOptimisticRepair
Input: q(~x), a query in Cforest of the form ∃~w.φ(~w, ~x),whose join graph T is a tree with root literal R(~x, ~y)
Σ, a set of key constraints, one per relationI, a database instance
Initialize N as an empty instance
if φ has exactly one literal thenfor each ~c such that there is some R(~c, ~d) in I do
if there is some ~d such that R(~c, ~d) ∈ I,
and {R(~c, ~d)} |= ∃~w.R(~c, ~y) then
Let ~t = R(~c, ~d)else
Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to N
end forelse
/* φ has more than one literal*/Let S1, . . . , Sm be the children of R in Tfor j := 1 to m do
Let Tj be the subtree of T whose root is Sj
Let φj be the conjunction of literals of Tj
Let ~wj = {w : w is a variable that occurs in φj and ~w, and w 6∈ ~xj}Let qj(~xj) = ∃~wj.φj(~xj, ~wj)Let Nj = BuildOptimisticRepair(qj, I)Add Nj to N
end forfor each ~c such that there is some R(~c, ~d) in I do
if there is some ~d and some valuation ν for the variables of ~y such that R(~c, ~d) ∈ I,
ν(~y) = ~d, and there is no j and ~cj such that ν(~xj) = ~cj and ~cj 6∈ qj(Nj) then
Let ~t = R(~c, ~d)else
Let ~t be any tuple of I such that ~t = R(~c, ~d), for some ~dend ifAdd ~t to N
end forend if
Figure 4.3: Algorithm to build the “optimistic” repair
Chapter 4. Rewritings for Queries with Grouping and Aggregation 66
4.3.3 Sound Ranges
In this subsection, we show that the ranges produced by the query rewritings are sound,
in the sense that the value of the aggregation function falls within the returned range on
every repair.
The next lemma shows that the rewritings produced by RewriteCount compute sound
ranges.
Lemma 4.7. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z, v) be a query of the following form:
select ~z, count(*)
from q∗(~z)
group by ~z
where q∗(~z) is a conjunctive query in Cforest.
Let Q be the first-order aggregate query returned by RewriteCount(q, Σ). Let I be a
database instance over R. Let I be a repair of I wrt Σ. Let ~t be a tuple, and low and up
be a pair of real numbers such that (~t, low, up) ∈ Q(I) and ~t ∈ consistentΣ(q∗, I). Let d
be such that (~t, d) ∈ q(I). Then, low ≤ d ≤ up.
Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the
roots of all trees of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Let ~x =⋃
i=1...m ~xi,
let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let ~x′ = ~x − ~z. Let
QConsistent(~x, ~z′) be the query obtained by invoking RewriteForest(q′, Σ).
Lower Bound. Since (~t, low, up) ∈ Q(I), the lower bound low of ~t is computed with
the following query:
QGlb(~z, glb) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Assume towards a contradiction that d < low. Then, there is a tuple (~c, ~t′) such
that (~c, ~t′) ∈ QConsistent(I) and (~c, ~t′) 6∈ q′(I). Then, (~c, ~t′) 6∈ consistentΣ(q′, I). By
Theorem 3.5, we conclude that (~c, ~t′) 6∈ QConsistent(I); contradiction.
Upper Bound. Since (~t, low, up) ∈ Q(I), the upper bound up of ~t is computed with
the following query:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 67
Let QLub(~z, lub) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
Assume towards a contradiction that d > up. Then, there is a valuation ν and a tuple
(~c, ~t′) such that ν(~x) = ~c, ν(~z′) = ~t′, ν(~z) = ~t, (~c, ~t′) ∈ q′(I), and either (1) (~c, ~t′) 6∈ q′(I);
or (2) I 6|= (∃~x′.QConsistent(~x, ~z′)).
Assume that (1) (~c, ~t′) 6∈ q′(I). Since I is a repair of I, by Proposition 3.6, I ⊆ I.
Thus, (~c, ~t′) 6∈ q′(I); contradiction. Assume that (2) I 6|= (∃~x′.QConsistent(~x, ~z′)).
Recall that ~x′ = ~x − ~z. By Theorem 3.5, (~c′, ~t′) 6∈ consistentΣ(q′, I), for every ~c′. In
particular, (~c, ~t′) 6∈ consistentΣ(q′, I). Recall that there is a valuation ν for the variables
of ~x and ~z′ such that ν(~x) = ~c, ν(~z′) = ~t′ and ν(~z) = ~t. Thus, ~t 6∈ consistentΣ(q∗, I);
contradiction.
The next lemma shows that the rewritings for queries with the sum operator compute
sound ranges.
Lemma 4.8. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z, v) be a query of the following form:
select ~z, sum(u)
from q∗(~z, u)
group by ~z
where q∗(~z, u) is a conjunctive query in Cforest.
Let Q be the first-order aggregate query returned by RewriteAgg(q, Σ). Let I be a
database instance over R. Let I be a repair of I wrt Σ. Let ~t be a tuple, and low and up
be a pair of real numbers such that (~t, low, up) ∈ Q(I) and ~t ∈ consistentΣ(q∗, I). Let d
be such that (~t, d) ∈ q(I). Then, low ≤ d ≤ up.
Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at
the roots of all trees of G. Let φ(~w, ~z, u) be the conjunction of literals of q∗. Let
~x =⋃
i=1...m ~xi, let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let ~x′ = ~x − ~z − u. Let
q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u). Let QConsistent(~x, ~z′) be the query obtained by in-
voking RewriteForest(q′, Σ). Let q′′ be the query q′′(~x, ~z′, u) = ∃ ~w′.φ(~x, ~w′, ~z′, u).
Lower Bound. Since (~t, low, up) ∈ Q(I), the lower bound low of ~t is computed with
the following query:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 68
QGlb(~z, glb) = select ~z, sum(v)
from QContribConsistent(~x, ~z′, v) ∨ QContribNonConsistent(~x, ~z′, v)
group by ~z
where QContribConsistent is the following query:
QContribConsistent(~x, ~z′, bottom) =
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
having bottom ≥ 0
and QContribNonConsistent is the following query:
QContribNonConsistent(~x, ~z′, bottom) =
select ~x, ~z′, min(u) as bottom
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′
having bottom < 0
Assume towards a contradiction that d < low. Since (~t, d) ∈ q(I), we must consider
the following cases.
First, assume that there is a valuation ν for the variables in ~z, ~x such that ν(~z) = ~t,
ν(~x) = ~c , ν(~z′) = ~t′, and
• (~c, ~t′) 6∈ q′(I); and
• there is some e such that e > 0; and
• either (~c, ~t′, e) ∈ QContribConsistent ∨ QContribNonConsistent(I).
Since e > 0, (~c, ~t′, e) ∈ QContribConsistent(I). Since (~c, ~t′) 6∈ q′(I), (~c, ~t′) 6∈consistentΣ(q′, I). By Theorem 3.5, we conclude that (~c, ~t′) 6∈ QConsistent(I). There-
and (~c, ~t′, e) ∈ q′′(I). Since I ⊆ I, and (~c, ~t′, e′) ∈ q′′(I), we have that (~c, ~t′, e′) ∈q′′(I). Notice that e and e′ correspond to the attribute bottom of QContribConsistent.
This attribute is computed as min(u), that is the minimum of the values of u for the
tuples of (~c, ~t′). Since (~c, ~t′, e) and (~c, ~t′, e′) satisfy the conditions of the from clause of
QContribConsistent, e < e′; contradiction.
Now, assume that (~c, ~t′, e) ∈ QContribNonConsistent(I). Since I ⊆ I, (~c, ~t′, e′) ∈q′′(I). Since e corresponds to the attribute bottom of QContribNonConsistent, e < e′;
contradiction.
Upper Bound The proof for the lowest upper bound is analogous to the proof for
the greatest lower bound.
The next lemma shows that the rewritings for queries with the min and max aggrega-
tion functions compute sound ranges.
Lemma 4.9. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z, v) be a query of the following form:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 70
select ~z, [min(u)| max(u)]from q∗(~z, u)
group by ~z
where q∗(~z, u) is a conjunctive query in Cforest.
Let Q be the first-order aggregate query returned by RewriteAgg(q, Σ). Let I be a
database instance over R. Let I be a repair of I wrt Σ. Let ~t be a tuple, and low and up
be a pair of real numbers such that (~t, low, up) ∈ Q(I) and ~t ∈ consistentΣ(q∗, I). Let d
be such that (~t, d) ∈ q(I). Then, low ≤ d ≤ up.
Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at
the roots of all trees of G. Let φ(~w, ~z, u) be the conjunction of literals of q∗. Let
~x =⋃
i=1...m ~xi, let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let ~x′ = ~x − ~z − u. Let
q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u). Let QConsistent(~x, ~z′) be the query obtained by in-
voking RewriteForest(q′, Σ). Let q′′ be the query q′′(~x, ~z′, u) = ∃ ~w′.φ(~x, ~w′, ~z′, u).
Lower Bound. Suppose that the aggregate function of q is max. Since (~t, low, up) ∈Q(I), the lower bound low of ~t is computed with the following query:
QGlb(~z, glb) = select ~z, max(u)
from QContribConsistent(~x, ~z′, u)
group by ~z
where QContribConsistent is the following query:
QContribConsistent(~x, ~z′, bottom) =
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
Assume towards a contradiction that d < low. Then, there is a valuation ν for the
variables in ~z, ~x such that ν(~z) = ~t, ν(~x) = ~c , ν(~z′) = ~t′, and
• there is some e such that (~c, ~t′, e) ∈ QContribConsistent(I); and
• there is some e′ such that e′ < e; and
Chapter 4. Rewritings for Queries with Grouping and Aggregation 71
• (~c, ~t′, e′) ∈ q′′(I).
We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries
with the sum operator.
Now, suppose that the aggregate function of q is min. Since (~t, low, up) ∈ Q(I), the
lower bound low of ~t is computed with the following query:
QGlb(~x, ~z, bottom) =
select ~z, min(bottom)
from QContribNonConsistent(~x, ~z′, u)
group by ~z
where QContribNonConsistent is the following query:
select ~x, ~z′, min(u) as bottom
from q′′(~x, ~z′, u) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~x, ~z′)
Assume towards a contradiction that d < low. Then, there is a valuation ν for the
variables in ~z, ~x such that ν(~z) = ~t, ν(~x) = ~c , ν(~z′) = ~t′, and
• there is some e such that (~c, ~t′, e) ∈ QContribNonConsistent(I); and
• there is some e′ such that e′ < e; and
• (~c, ~t′, e′) ∈ q′′(I).
We arrive to a contradiction by the same arguments used in Lemma 4.8 for queries
with the sum operator.
Upper Bound For the max operator, we can give an argument analogous to the
argument given for the lower bound of the min operator. For the min operator, we
can give an argument analogous to the argument given for the lower bound of the max
operator.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 72
4.3.4 Tight Ranges
In this section, we show that the ranges produced by the query rewritings are tight. For
this, we must exhibit two repairs, where the result of the aggregation function corresponds
to the greatest lower bound in one repair, and to the lowest upper bound in the other. For
example, if the query has the count(*) operator, the repair that we need for the greatest
lower bound turns out to be the “pessimistic” repair M used in the correctness proof of
the first-order rewritings of Section 3.3.3. For the lowest upper bound, the needed repair
is the “optimistic” repair N that we introduced in Section 4.3.2.
We start by showing that the rewritings produced by RewriteCount give tight bounds.
In the next lemma, we show that the greatest lower bound of count(*) can be obtained
by executing the query on the pessimistic repair M. We also show that the query
rewriting that we obtain correctly returns such bound.
Lemma 4.10. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z) be a query of the following form:
select ~z, count(*)
from q∗(~z)
group by ~z
where q∗(~z) is a query in Cforest.
Let G the the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots
of each tree of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Let ~x =⋃
i=1...m ~xi,
let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let Q(~z, l, u) be the
first-order aggregate query returned by RewriteCount(q, Σ). Let I be an instance over
R. Let ~t be a tuple and low and up be a pair of real numbers.
Then, there is a repair M of I wrt Σ and a bag B such that B = q(M), and the
following conditions hold:
1. for every valuation ν such that ν(~x) = ~c, ν(~z′) = ~t′, and ν(~z) = ~t, if (~c, ~t′) ∈ q′(M),
then ~c ∈ consistentΣ(q′[~z/~t], I), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = low, and
3. if (~t, low, up) ∈ Q(I), then |~t|B = low.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 73
Proof. Let M be the pessimistic repair obtained by invoking the algorithm BuildPess-
imisticRepair(q, Σ, I). Condition (1) holds by Lemma 3.10. We must now prove Con-
ditions (2) and (3).
In order to prove Condition 2, let ~t be a tuple, and low, and up be a pair of real
numbers such that (~t, low, up) ∈ aggconsistentΣ(q, I). Then, there is a repair I of I
wrt Σ and a bag B′ such that B′ = q(I) and |~t|B′ = low. Furthermore, by Lemma
4.7, since M is a repair of I wrt Σ, |~t|B ≥ low. Assume towards a contradiction that
|~t|B > low. Then, there is a valuation ν for the variables of ~x and ~z such that ν(~x) = ~c,
ν(~z) = ~t and ν(~z′) = ~t′, and one of the following conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now, as-
sume that (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I). Then, (~c, ~t′) 6∈ consistentΣ(q′[~z/~t], I).
By Condition 1, we have that (~c, ~t′) 6∈ q′[~z/~t](M); contradiction.
In order to prove Condition 3, let ~t, low, and up be such that (~t, low, up) ∈ Q(I).
Since M is a repair of I, by Lemma 4.7, |~t|B ≥ low. Let QConsistent(~x, ~z′) be the query
obtained by invoking RewriteForest(q′, Σ). Then, the lower bound low of ~t is computed
with the following query:
QGlb(~z, low) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Assume towards a contradiction that |~t|B > low. Then, there is a valuation ν for the
variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and one of the following
conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,
assume that (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I). Since (~c, ~t′) 6∈ QConsistent(I),
by Theorem 3.5, (~c, ~t′) 6∈ consistentΣ(q′, I). Then, by Condition 1, we have that (~c, ~t′) 6∈q′(M); contradiction.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 74
In the next lemma, we show that the lowest upper bound of count(*) can be obtained
by executing q on the optimistic repair N . We also show that the query rewriting of q
correctly returns such bound.
Lemma 4.11. Let R be a schema. Let Σ be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(~z) be a query in Cforest of the following
form:
select ~z, count(*)
from q∗(~z)
group by ~z
where q∗(~z) is a query in Cforest.
Let G the the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots
of each tree of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Let ~x =⋃
i=1...m ~xi,
let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let Q(~z, l, u) be the
first-order aggregate query returned by RewriteCount(q, Σ). Let I be an instance over
R. Let ~t be a tuple and low and up be a pair of real numbers.
Then, there is a repair N of I wrt Σ and a bag B such that B = q(N ), and the
following conditions hold:
1. for every valuation ν such that ν(~x) = ~c and ν(~z) = ~t, if ~c ∈ possibleΣ(q′[~z/~t], I),
then ~c ∈ q′[~z/~t](N ), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = up, and
3. if (~t, low, up) ∈ Q(I), then |~t|B = up.
Proof. Let N be the optimistic repair obtained by invoking the algorithm BuildOpti-
misticRepair(q, Σ, I). Condition (1) holds by Lemma 4.6. We must now prove Condi-
tions (2) and (3).
In order to prove Condition 2, let ~t be a tuple, and low and up be real numbers such
that (~t, low, up) ∈ aggconsistentΣ(q, I). Then, there is a repair I of I wrt Σ and a bag
B′ such that B′ = q(I) and |~t|B′ = up. Furthermore, since N is a repair of I wrt Σ, by
Lemma 4.7, |~t|B ≤ up. Assume towards a contradiction that |~t|B < up. Then, there is a
valuation ν for the variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and
one of the following conditions holds:
Chapter 4. Rewritings for Queries with Grouping and Aggregation 75
• (~c, ~t′) ∈ q′(I) and |(~c, ~t′)|B′ > 1; or
• (~c, ~t′) 6∈ q′(N ) and (~c, ~t′) ∈ q′(I).
Assume that (~c, ~t′) ∈ q′(I) and |(~c, ~t′)|B′ > 1. This contradicts Lemma 4.4. Now,
assume that (~c, ~t′) 6∈ q′(N ) and (~c, ~t′) ∈ q′(I). Then, ~c ∈ possibleΣ(q′[~z/~t], I). By
Condition 1, we have that ~c ∈ q′[~z/~t](N ); contradiction.
In order to prove Condition 3, let ~t, low, and up be such that (~t, low, up) ∈ Q(I).
Since N is a repair of I, by Lemma 4.7, |~t|B ≤ up. Let ~x′ = ~x−~z. Let QConsistent(~x, ~z′)
be the query obtained by invoking RewriteForest(q′, Σ). Since (~t, low, up) ∈ Q(I), the
upper bound up of ~t is computed with the following query:
Let QLub(~z, up) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
Assume towards a contradiction that |~t|B < up. Then, there is a valuation ν for the
variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and either:
• (~c, ~t′) is accounted for more than once in the from clause of QLub; or
• (~c, ~t′) 6∈ q′(N ), (~c, ~t′) ∈ q′(I), and I |= (∃~x′.QConsistent[~z/~t]).
Assume that (~c, ~t′) is accounted for more than once in the from clause of QLub. This
is a contradiction since by definition the from clause of a first-order aggregate query is
computed using set semantics. Now, assume that (~c, ~t′) 6∈ q′(N ), (~c, ~t′) ∈ q′(I), and
I |= (∃~x′.QConsistent[~z/~t]). Since (~c, ~t′) ∈ q′(I), we have that ~c ∈ possibleΣ(q′[~z/~t], I).
Thus, by Condition 1, ~c ∈ q′[~z/~t](N ); contradiction.
For the unary operators, the proof of tightness proceeds in an analogous way, except
that the optimistic and pessimistic repairs have to be modified to ensure every tuple has
the minimum (or maximum, depending on the case) for attribute u. We next show how
to obtain a pessimistic repair for queries with the sum operator.
Algorithm BuildPessimisticRepairForSum (q, I,M∗)
Input: A query q of the form
select ~z, sum(u)
Chapter 4. Rewritings for Queries with Grouping and Aggregation 76
from q∗(~z)
group by ~z
where q∗ is a conjunctive query in Cforest
I, an instance
M∗, an pessimistic repair
Output:M, an pessimistic repair
Initialize M as M∗
Let R(~x, ~y) be the literal of q where u appears
for each tuple R(~c, ~d) of M do
Let ν be a valuation for the variables of R such that ν(~x) = ~c and ν(~y) = ~d
for every valuation ν ′ for the variables of R such that ν ′(~x) = ~c′, ν ′(~y) = ~d′,
R(~c′, ~d′) ∈ I, and ν(z) = ν ′(z) for every z such that z 6= u do
if ν ′(u) < ν(u) then
Replace R(~c, ~d) with R(~c′, ~d′) in Mend if
end for
end for
Notice in the algorithm that a tuple R(~c, ~d) is replaced only if there is another tuple
with the same values, except for the attribute u, and the other tuple has a smaller value
on u (condition ν ′(u) < ν(u) in the algorithm). In the rewriting for the lower bound of
the sum operator, this corresponds to the fact that for positive values we aggregate over
the minimum value of u for all tuples in the intermediate result. In contrast, for the upper
bound, we aggregate over the maximum value of u. Thus, for the upper bound, a similar
algorithm can be used, where we replace tuples for which the condition ν ′(u) > ν(u)
is satisfied. Since we choose the conditions that correspond to positive numbers in the
rewriting given in RewriteAgg, the tightness results for the sum operator need to restrict
the domain of the aggregated value to range over positive numbers (for min and max we
do not have this restriction). In Figure 4.4, we summarize the repairs that must be
modified in order to obtain the tight bounds of each aggregation function, and which
condition must be checked.
The following lemma shows that the greatest lower bound computed for the sum
operator can be obtained from the pessimistic repair computed with the procedure given
above. We also show that our query rewriting correctly returns such bound.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 77
Function Bound Repair Condition
max glb pessimistic ν ′(u) < ν(u)
max lub optimistic ν ′(u) > ν(u)
sum glb pessimistic ν ′(u) < ν(u)
sum lub optimistic ν ′(u) > ν(u)
min glb optimistic ν ′(u) < ν(u)
min lub pessimistic ν ′(u) > ν(u)
Figure 4.4: Repairs that must be used to obtain the tight bounds of unary operators
Lemma 4.12. Let R be a schema. Let Σ be a set of integrity constraints, consisting of
one key dependency per relation of R. Let q(~z) be a query of the following form:
select ~z, sum(u)
from q∗(~z, u)
group by ~z
where q∗(~z, u) is a conjunctive query in Cforest and u ranges over the positive numbers.
Let G the the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at the roots
of each tree of G. Let φ(~w, ~z, u) be the conjunction of literals of q∗. Let ~x =⋃
i=1...m ~xi,
let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let q′(~x, ~z′) = ∃ ~w′, u.φ(~x, ~w′, ~z′, u). Let Q(~z, l, u)
be the first-order aggregate query returned by RewriteAgg(q, Σ). Let I be an instance
over R. Let ~t be a tuple and low and up be a pair of real numbers. Let q′′(~x, ~z′, u) =
∃ ~w′.φ(~x, ~w′, ~z′, u).
Then, there is a repair M of I wrt Σ and some value d such that (~t, d) ∈ q(M), and
the following conditions hold:
1. for every valuation ν such that ν(~x) = ~c, ν(~z′) = ~t′, and ν(~z) = ~t, if (~c, ~t′) ∈ q′(M),
then ~c ∈ consistentΣ(q′[~z/~t], I), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then d = low, and
3. if (~t, low, up) ∈ Q(I), then d = low.
Proof. Let M∗ be the repair obtained by invoking the algorithm BuildPessimistic-
Repair(q, Σ, I). Let M be the repair obtained by invoking the algorithm BuildPess-
imisticRepairForSum(q, I,M∗). Condition (1) holds by Lemma 3.10. We must now
prove Conditions (2) and (3).
Chapter 4. Rewritings for Queries with Grouping and Aggregation 78
In order to prove Condition 2, let ~t be a tuple, and low, and up be a pair of real
numbers such that (~t, low, up) ∈ aggconsistentΣ(q, I). Then, there is a repair I of I
wrt Σ such that (~t, low) ∈ q(I). Furthermore, by Lemma 4.8, since M is a repair of I wrt
Σ, d ≥ low. Assume towards a contradiction that d > low. Let B = q′(M). Then, there
is a valuation ν for the variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′,
and one of the following conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and (~c, ~t′, e′) ∈ q′(I); or
• (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,
assume that there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and (~c, ~t′, e′) ∈q′(I). Let ν ′ and ν ′′ be valuations such that for every w 6= u, ν(w) = ν ′(w) and
ν(w) = ν ′′(w); ν ′(w) = e; and ν ′′(w) = e′. Since M is constructed using the algo-
rithm BuildPessimisticRepairForSum and I ⊆ I, ν ′(w) < ν ′′(w). Thus, e < e′;
contradiction. Finally, assume that (~c, ~t′) ∈ q′[~z/~t](M) and (~c, ~t′) 6∈ q′[~z/~t](I). Then,
(~c, ~t′) 6∈ consistentΣ(q′[~z/~t], I). By Condition 1, we have that (~c, ~t′) 6∈ q′[~z/~t](M); con-
tradiction.
In order to prove Condition 3, let ~t, low, and up be such that (~t, low, up) ∈ Q(I).
Since M is a repair of I, by Lemma 4.8, d ≥ low. Let QConsistent(~x, ~z′) be the query
obtained by invoking RewriteForest(q′, Σ). Since u ranges only over positive numbers,
the lower bound low of ~t is computed with the following query:
QGlb(~z, glb) = select ~z, sum(v)
from QContribConsistent(~x, ~z′, v)
group by ~z
where QContribConsistent is the following query:
QContribConsistent(~x, ~z′, bottom) =
select ~x, ~z′, min(u) as bottom
from QConsistent(~x, ~z′) ∧ q′′(~x, ~z′, u)
group by ~x, ~z′
Chapter 4. Rewritings for Queries with Grouping and Aggregation 79
having bottom ≥ 0
Assume towards a contradiction that d > low. Then, there is a valuation ν for the
variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t and ν(~z′) = ~t′, and one of the following
conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and
(~c, ~t′, e′) ∈ QContribConsistent(I); or
• (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,
assume that there are e and e′ such that e > e′, (~c, ~t′, e) ∈ q′(M) and (~c, ~t′, e′) ∈QContribConsistent(I). Since e′ is computed as min(u) in QContribConsistent,
and M ⊆ I, e′ < e; contradiction. Finally, assume that (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈QConsistent(I). Since (~c, ~t′) 6∈ QConsistent(I), by Theorem 3.5, we have that (~c, ~t′) 6∈consistentΣ(q′, I). Then, by Condition 1, we have that (~c, ~t′) 6∈ q′(M); contradic-
tion.
Notice that the proof above is similar to the one for Lemma 4.10, except that we need
to account for the fact that each tuple may contribute a value greater than one. A proof
similar to Lemma 4.11 can be given for the lowest upper bound.
4.3.5 Putting It All Together
The next lemma states the correctness of the algorithm RewriteCount. The correctness
for the unary operators can be obtained analogously by employing the optimistic and
pessimistic repairs as shown in Figure 4.4.
Lemma 4.13. Let R be a schema. Let Σ be a set of integrity constraints, consisting
of one key dependency per relation of R. Let q(~z) be a query in Cforest of the following
form:
select ~z, count(*)
from q∗(~w, ~z)
group by ~z
Chapter 4. Rewritings for Queries with Grouping and Aggregation 80
Let Q(~z, l, u) be the first-order aggregate query returned by RewriteCount(q, Σ). Let
I be an instance over R. Then, for every tuple ~t, and pair of real numbers low and up,
we have that (~t, low, up) ∈ aggconsistentΣ(q, I) iff (~t, low, up) ∈ Q(I).
Proof. Let G be the join graph of q. Let R1(~x1, ~y1), . . . , Rm(~xm, ~ym) be the literals at
the roots of all trees of G. Let φ(~w, ~z) be the conjunction of literals of q∗. Following the
algorithm RewriteCount, let ~x =⋃
i=1...m ~xi, let ~z′ = ~z − ~x, and let ~w′ = ~w − ~x. Let
~x′ = ~x−~z. Let q′(~x, ~z′) = ∃ ~w′.φ(~x, ~w′, ~z′). Let QConsistent(~x, ~z′) be the query obtained
by invoking RewriteForest(q′, Σ).
(⇒) Let ~t be a tuple and low and up be real numbers such that (~t, low, up) ∈aggconsistentΣ(q, I). By Lemma 4.10, there is a “pessimistic” repair M of I wrt Σ
and a bag B such that B = q(M), and the following conditions hold:
1. for every valuation ν such that ν(~x) = ~c, ν(~z′) = ~t′, and ν(~z) = ~t, if (~c, ~t′) ∈ q′(M),
then ~c ∈ consistentΣ(q′[~z/~t], I), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = low.
towards a contradiction that (~t, low, up) 6∈ Q(I). Let low′ be a value computed as follows:
QGlb(~z, low′) = select ~z, count(*)
from QConsistent(~x, ~z′)
group by ~z
Assume that low′ < low. Then, there is a valuation ν for the variables of ~x and ~z
such that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, and one of the following conditions holds:
• (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1; or
• (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I).
Assume that (~c, ~t′) ∈ q′(M) and |(~c, ~t′)|B > 1. This contradicts Lemma 4.4. Now,
assume that (~c, ~t′) ∈ q′(M) and (~c, ~t′) 6∈ QConsistent(I). By Theorem 3.5, (~c, ~t′) 6∈consistentΣ(q′, I). By Condition 1 above, (~c, ~t′) 6∈ q′(M); contradiction.
Assume towards a contradiction that low′ > low. Then, there is a valuation ν for
the variables of ~x and ~z such that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, (~c, ~t′) 6∈ q′(M) and
Chapter 4. Rewritings for Queries with Grouping and Aggregation 81
(~c, ~t′) ∈ QConsistent(I). Since (~c, ~t′) ∈ QConsistent(I), by Theorem 3.5, (~c, ~t′) ∈consistentΣ(q′, I). Then, since M is a repair of I wrt Σ, we have that (~c, ~t′) ∈ q′(M);
contradiction.
By Lemma 4.11, there is an “optimistic” repair N of I wrt Σ and a bag B such that
B = q(N ), and the following conditions hold:
1. for every valuation ν such that ν(~x) = ~c and ν(~z) = ~t, if ~c ∈ possibleΣ(q′[~z/~t], I),
then ~c ∈ q′[~z/~t](N ), and
2. if (~t, low, up) ∈ aggconsistentΣ(q, I), then |~t|B = up.
towards a contradiction that (~t, low, up) 6∈ Q(I). Let up′ be a value computed as follows:
Let QLub(~z, up′) = select ~z, count(*)
from q′(~x, ~z′) ∧ (∃~x′.QConsistent(~x, ~z′))
group by ~z
Assume that up′ < up. Then, there is a valuation ν for the variables of ~x and
~z such that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, (~c, ~t′) 6∈ q′(N ), (~c, ~t′) ∈ q′(I), and I |=∃~x′.QConsistent(~x, ~t′). Since (~c, ~t′) ∈ q′(I), (~c, ~t′) ∈ possibleΣ(q′, I). Thus, by Lemma
4.6, (~c, ~t′) ∈ q′(N ); contradiction.
Assume that up′ < up. Then, there is a valuation ν for the variables of ~x and ~z such
that ν(~x) = ~c, ν(~z) = ~t, ν(~z′) = ~t′, and one of the following two cases holds. First,
(~c, ~t′) ∈ q′(N ) and |(~c, ~t′)|B > 1. But this contradicts Lemma 4.4. Second, (~c, ~t′) ∈ q′(N )
and either (1) (~c, ~t′) 6∈ q′(I), or (2) I 6|= ∃~x′.QConsistent(~x, ~t′). Assume that (1) (~c, ~t′) 6∈q′(I). Since N is a repair of I wrt Σ, N ⊆ I. Thus, (~c, ~t′) 6∈ q′(N ); contradiction.
Assume that (2) I 6|= ∃~x′.QConsistent(~x, ~t′). Recall that ~x′ = ~x − ~z. By Theorem 3.5,
(~c′, ~t′) 6∈ consistentΣ(q′, I), for every ~c′. In particular, (~c, ~t′) 6∈ consistentΣ(q′, I). Thus,
(~c, ~t′) 6∈ q′(N ); contradiction.
(⇐) Let ~t be a tuple and low and up be real numbers such that (~t, lb, up) ∈ Q(I). In
order to prove that (~t, low, up) ∈ aggconsistentΣ(q, I), we must show that:
1. For every repair I of I wrt Σ, if B = q(I), then low ≤ |~t|B ≤ up.
2. There is a repair I of I wrt Σ, and a bag B such that B = q(I) and |~t|B = low.
Chapter 4. Rewritings for Queries with Grouping and Aggregation 82
3. There is a repair I of I wrt Σ, and a bag B such that B = q(I) and |~t|B = up.
Claim 1 follows by Lemma 4.7. Claim 2 follows by Lemma 4.10. Claim 3 follows by
Lemma 4.11.
4.4 Related Work
Our work on aggregation is inspired by Arenas et al. [ABC+03b], who were the first to
propose the use of ranges in a semantics for consistent query answering. The work of
Arenas et al. is restricted to queries of the following form:
select F (A)
from r
where F is an aggregation function, r is a single relation, and A is an attribute from
r. Notice that such queries have no grouping and no selection or join conditions (i.e., no
where clause). In this chapter, we consider a much richer class of queries. For the class
of queries considered by Arenas et al., the semantics proposed in their paper and our
semantics for aggregate queries coincide. However, we need to extend their semantics in
order to be able to deal with queries that perform grouping.
In their paper, Arenas et al. [ABC+03b] consider functional dependencies. If there
is exactly one functional dependency on the (only) relation of the query, they show that
the problem of obtaining the lowest upper and greatest lower bounds is tractable for the
count(*), min, max, sum, and avg functions. Except for avg, we considered all these
functions in our class Caggforest. Arenas et al. also show the intractability of queries with
the count(distinct) operator and exactly one functional dependency. If the relation
of the query has more than one functional dependency, they show that the problem
of obtaining tight bounds is intractable for all the aggregate functions they consider
(count(*), min, max, sum, and avg, count(distinct)). This gives further evidence of
the maximality of the class considered in this chapter: going from one to two functional
dependencies may lead to intractability even for queries on just one relation and with no
grouping.
Chapter 5
Complexity-Theoretic Analysis
In the previous chapters, we presented query rewriting algorithms that work on a broad
class of queries. In this chapter, we show the maximality of this class based on complexity-
theoretic arguments. In Section 5.1, we show that minimal relaxations of the conditions of
the class lead to intractability. Then, in Section 5.2, we embark on a more ambitious goal:
for a large class of conjunctive queries, we show that the conditions of the class Cforest
presented in Chapter 3 are not only sufficient, but they are also necessary conditions for
a query to be first-order rewritable.
5.1 Minimal Relaxations of Cforest
In this section, we show that minimal relaxations of the conditions of Cforest lead to
intractability. In particular, we show the intractability of the problem of computing
consistent answers for: (1) a conjunctive query whose join graph is a cycle of length
two; and (2) a conjunctive query whose join graph is a forest, but the query has some
nonkey-to-key joins that are not full.
Chomicki and Marcinkowski [CM05] proved that the problem of computing consistent
answers for a query with a single nonkey-to-nonkey join is coNP-complete. Their result
used a query with repeated relation symbols (specifically, a query with only two literals
both for a single relation R). We can use their insight to show that the problem of
computing consistent answers for the following query without repeated relation symbols,
but with a single nonkey-to-nonkey join is also coNP-complete.
qnk = ∃x, x′, y.S1(x, y) ∧ S2(x′, y)
83
Chapter 5. Complexity-Theoretic Analysis 84
Notice that qnk has a cycle of length two (actually, a nonkey-to-nonkey join), and
no nonkey-to-key joins. Our proof of hardness is a simple modification to the re-
sults of Chomicki and Marcinkowski [CM05] and uses a reduction from the problem
MONOTONE-3SAT, which is well known to be NP-complete. The only difference between
the MONOTONE-3SAT and 3SAT problems is that the former assumes that the input 3CNF
propositional formula is monotone. That is, each clause Φi contains either positive or
negative atoms, but not both. We shall say that a clause that contains only positive
(negative) atoms is a positive (negative) clause.
Lemma 5.1. Let q be the query ∃x, x′, y.S1(x, y)∧ S2(x′, y). Then, CONSISTENT(q, Σ) is
coNP-hard.
Proof. We will prove hardness by reduction from MONOTONE-3SAT. Let Φ = Φ1∧ · · · ∧Φm
be a 3CNF formula such that each clause Φi contains either positive or negative atoms,
but not both. We shall build an instance I as follows:
• For each positive clause Φi and each atom z that occurs in Φi, we add a tuple
S1(i, z) to I.
• For each negative clause Φi and each atom z that occurs in Φi, we add a tuple
S2(i, z) to I.
We now show that consistentΣ(q, I) = false iff Φ is satisfiable.
(⇒) Since consistentΣ(q, I) = false, there exists a repair I of I such that I 6|= q.
We now build a valuation v for the variables of Φ as follows. For each variable z, we let
v(z) = true if there is some i such that S1(i, z) ∈ I; and we let v(z) = false if there is
some i such that S2(i, z) ∈ I. It is easy to see that v is a truth valuation that satisfies
Φ.
(⇐) Assume that Φ is satisfiable. Let v be a truth assignment for the variables of Φ.
We shall build a repair I as follows. For each positive clause Φi, select a variable z that
appears in Φi and such that v(z) = true. Let S1(i, z) ∈ I. For each negative clause Φi,
select a variable z that appears in Φi and such that v(z) = false. Let S2(i, z) ∈ I. It is
easy to see that I 6|= q.
Now, we show the intractability of the problem for a conjunctive query whose join
graph is a forest, but the query has nonkey-to-key joins that are not full. In particular,
Recall that the problem of computing consistent answers is intractable for the query
qnk = ∃x, x′, y.R1(x, y)∧R2(x′, y). Notice that qnk and q have exactly the same join graph.
The only difference between them is that in qnk, the two literals are related exclusively
by a nonkey-to-nonkey join; whereas in q, they are related by both a key-to-key and a
nonkey-to-nonkey join. Our intuition is that a query with a cyclic join graph may be
tractable only if there are literals related by more than one type of join (e.g., nonkey-
to-nonkey and key-to-key). We formalize this intuition with the definition of a class C∗,which essentially “separates” the different types of joins of the query. In C∗, every pair of
literals can be related by at most one of type of join (i.e., key-to-key, nonkey-to-nonkey,
and nonkey-to-key).
Definition 5.3. Let q be a conjunctive query without repeated relation symbols and all
of whose nonkey-to-key joins are full. We say that q is in class C∗ if for every pair R
and R′ of literals of q at most one of the following conditions holds:
• there is a key-to-key join between R and R′.
• there is a nonkey-to-nonkey join between R and R′.
• there are literals R1 . . . Rm in q such that there is a nonkey-to-key join from R to
R1, from Rm to R′, and from Ri to Ri+1, for every i such that 1 ≤ i < m.
Chapter 5. Complexity-Theoretic Analysis 89
Notice that C∗ is a fairly broad class of queries. For example, it includes the class
of queries that have exclusively nonkey-to-key joins. In general, the only queries that
are outside C∗ are the ones that have a pair of literals related by more than one type of
join. As anecdotal evidence of the practicality of the class, the only query in the TPC-H
benchmark [TPC03] that has nonkey-to-nonkey joins (Query 5) is in C∗. From the results
of this chapter, we can immediately conclude that the problem of computing consistent
answers for this query is not first-order rewritable.
We will consider a class, called Chard, of all queries of C∗ that are not in Cforest. The
main result of this chapter, Theorem 5.5, proves that the problem of computing the
consistent answers for every query of Chard is coNP-complete.
Definition 5.4. We say that a query q is in class Chard if q ∈ C∗ and q 6∈ Cforest.
Theorem 5.5. Let q be a query such that q ∈ Chard. Then, CONSISTENT(q, Σ) is coNP-
complete in data complexity.
Our motivation to provide a dichotomy for C∗ is the following. First, for a fairly broad
class of queries we can test in polynomial time if the problem of computing consistent
answers is tractable. Second, our results are an initial step towards proving a dichotomy
for the larger class of all conjunctive queries. Indeed, as a result of our work, future
efforts for finding dichotomy results for conjunctive queries need to focus only on queries
whose literals are related by more than one type of join.1
In general, by Ladner’s Theorem [Lad75], there are classes of coNP problems for
which there is no dichotomy between P and coNP-complete problems. However, this
is not the case for the class of queries that is the focus of this section. In fact, as a
corollary of Theorems 3.5 and 5.5, we get a dichotomy between membership in P and
coNP-completeness. Notice that, given a query q such that q ∈ C∗, it can be decided in
polynomial time on which side of the dichotomy the query q falls.
Corollary 5.6. Let q be a query such that q ∈ C∗. Then, CONSISTENT(q, Σ) is either in
P , or it is coNP-complete.
Under a complexity-theoretic assumption, we also get a dichotomy between first-order
rewritability and first-order inexpressibility for the class C∗. That is, for all the queries
of C∗ that are not in Chard, we can produce a first-order rewriting using our algorithm
1Since C∗ intersects, but does not contain Cforest, we know that there are queries outside C∗ for whichthe problem of computing consistent answers is tractable.
Chapter 5. Complexity-Theoretic Analysis 90
RewriteForest. For the queries of Chard, since the problem of obtaining consistent an-
swers is coNP-complete, there is no first-order rewriting, unless P=NP (which is unlikely).
Corollary 5.7. Let q be a query such that q ∈ C∗. Assuming P 6= NP , the problem
CONSISTENT(q, Σ) is first-order rewritable iff q ∈ Cforest.
Tractable but not First-Order Rewritable Queries
An interesting question is whether there are queries for which the problem of computing
consistent answers is tractable, yet not first-order rewritable. Although this remains
open for conjunctive queries without inequalities, we now show that there are tractable
conjunctive queries with inequalities that are not first-order rewritable.
Consider a schema with one binary relation R(E, S). Assume that E is the key of
the relation. Consider the following query q:
q = ∃e1, e2, s : R(e1, s) ∧R(e2, s) ∧ e1 6= e2
In order to find the consistent answers for q, we construct a graph of the inconsistent
database instance as follows.2 Let I be a database instance with one binary relation
R(E, S). The graph G of I is a bipartite graph G, with partitions E and S. Partitions
E and S have one vertex for each value in the active domain of attributes E and S,
respectively. The set of edges of G consists of all tuples (e, s) of R.
We use the graph of I to introduce the following necessary and sufficient condition
for consistentΣ(q, I) = false.
Lemma 5.8. Let I be a database with one binary relation R(E, S), possibly inconsistent
wrt a functional dependency Σ = {E → S}. Then, consistentΣ(q, I) = false iff the
graph G of I has a perfect matching.
Proof. ⇐ Assume that G has a perfect matching M . We can build an instance I by
creating a tuple in I for each edge in M . Since M is a matching, each vertex from
partition S is incident to at most one edge. Therefore, I 6|= q. Also, since the matching
is perfect, every key appears in I. Consequently, I is minimal, and therefore it is a repair
of I wrt Σ.
2Notice that unlike the join graph of a query, this graph is constructed from a database instance, nota query.
Chapter 5. Complexity-Theoretic Analysis 91
⇒ Assume that consistentΣ(q, I) = false. Then, there must exist a repair I of I
wrt Σ such that I 6|= q. We can construct a graph G′ by selecting the edges of G that
correspond to tuples of I. It is easy to see that G′ is a perfect matching of G.
There are a number of algorithms in the literature for deciding the existence of a
perfect bipartite matching. For example, one of the best known is given by Hopcroft and
Karp [HK75], and runs in O(n2.5) time. Therefore, q is a tractable query. We now show
that no approach based on query-rewriting works for q.
Theorem 5.9. There is no first-order rewriting Q of q such that consistentΣ(q, I) =
Q(I) for every instance I.
Proof. Let A1, . . . , An be a system of distinct representatives. A system of distinct rep-
resentatives [Ost70] of A1, . . . , An is a sequence of n distinct elements a1, . . . , an with
ai ∈ Ai, 1 ≤ i ≤ n. Let R be a binary relation that encodes A1, . . . , An as follows:
R(i, x) iff x ∈ Ai. Let G be the graph of R as constructed above. Clearly, G has a
perfect matching iff A1, ..., An has a system of distinct representatives. By Lemma 5.8,
consistentΣ(q, I) = false iff G has a perfect matching.
Let I be the database instance that consists of relation R. Assume that there is
a first order query Q such that I 6|= Q iff consistentΣ(q, I) = false. Then, Q can
test whether A1, ..., An has a system of distinct representatives. But it is known in the
literature [LW95] that relational algebra, with an appropriate encoding of sets, cannot
test whether a family of sets has a system of distinct representatives; contradiction.
5.2.2 Basic Intractable Cases
The intractability of all queries in Chard will be shown as follows. First, we show in
Lemma 5.10 that the problem of computing consistent answers for conjunctive queries
is in coNP. This is a result known in the literature, but we briefly give a proof for our
setting. For hardness, we will use a reduction from the problem of computing consistent
answers for one of two particular queries to the problem of computing consistent answers
for q. One of these specific queries is the query qnk = ∃x, x′, y.S1(x, y) ∧ S2(x′, y). This
query has a nonkey-to-nonkey join, and was shown to be intractable in Lemma 5.1. The
other query has a cycle of nonkey-to-key joins, and is shown to be intractable in Lemma
5.11.
Chapter 5. Complexity-Theoretic Analysis 92
The next lemma shows that the problem of computing consistent answers for con-
junctive queries is in coNP.
Lemma 5.10. Let q be a conjunctive query. The problem CONSISTENT(q, Σ) is in coNP.
Proof. Let I be an instance. In order to decide whether ~t 6∈ consistentΣ(q, I), it suffices
to show a repair I of I such that I 6|= q[~t]. The size of I is polynomially bounded by the
size of I. In particular, by Proposition 3.6, I ⊆ I. Furthermore, I 6|= q[~t] can be checked
in polynomial time, since q is a conjunctive query.
In the next lemma, we show the coNP hardness of computing consistent answers for
one of the two particular queries that will be used in Lemma 5.14. The coNP hardness
of the other query was proven in Lemma 5.1.
Lemma 5.11. Let q = ∃x, y.T1(x, y) ∧ T2(y, x). Then, the problem CONSISTENT(q, Σ) is
coNP-hard.
Proof. We will prove hardness by reduction from MONOTONE-3SAT. Let Φ = Φ1∧ · · · ∧Φm
be a monotone 3CNF formula. We shall build an instance I as follows:
• For each atom z, let Φi1 , . . . , Φin be the positive clauses where z occurs. Add tuples
T1(< Φi1 , . . . , Φin >, z) and T2(z, < Φi1 , . . . , Φin >) to I.
• For each atom z, let Φi1 , . . . , Φin be the negative clauses where z occurs. Add tuples
T1(< Φi1 , . . . , Φin >, z) and T2(z, < Φi1 , . . . , Φin >) to I.
We now show that consistentΣ(q, I) = false iff Φ is satisfiable.
(⇒) Since consistentΣ(q, I) = false, there exists a repair I of I such that I 6|= q.
Assume towards a contradiction that there are tuples T1(c, z) ∈ I and T1(c′, z) ∈ I such
that c 6= c′. By construction of I, if T2(z, d) ∈ I, then d = c or d = c′. By Propositions
3.6 and 3.7, either T2(z, c) ∈ I or T2(z, c′) ∈ I. Thus, I |= q; contradiction.
We now build a valuation v for the variables of Φ as follows. For each variable z,
we let v(z) = true if there is some c such that T1(c, z) ∈ I and c is a list of positive
clauses; and we let v(z) = false if there is some i such that T1(c, z) ∈ I, and c is a list
of negative clauses. It is easy to see that v is a truth valuation that satisfies Φ.
(⇐) Assume that Φ is satisfiable. Let v be a truth assignment for the variables of Φ.
We shall build a repair I as follows. For each positive clause Φi, select a variable z that
appears in Φi and such that v(z) = true. Add T1(c, z) to I, where c is a list of positive
Chapter 5. Complexity-Theoretic Analysis 93
clauses. For each negative clause Φi, select a variable z that appears in Φi and such that
v(z) = false. Add T1(c, z) to I, where c is a list of negative clauses. For each variable
z, if v(z) = false, add T2(z, c) to I, where c is a list of positive clauses; if v(z) = true,
add T2(z, c) to I, where c is a list of negative clauses. It is easy to see that I 6|= q.
We now give some auxiliary results before proving Lemma 5.14. The next lemma
generalizes Lemma 5.11 from cycles of length two to the case of cycles of arbitrary length.
Lemma 5.12. Let q be the query ∃w1, . . . , wm.S1(wm, w1)∧S2(w1, w2)∧· · ·∧Sm(wm−1, wm).
Let q′ = ∃x, y.T1(x, y)∧T2(y, x) Then, there is a polynomial time reduction from the prob-
lem CONSISTENT(q′, Σ′) to the problem CONSISTENT(q, Σ).
Proof. Let I ′ be an instance over the schema of q′. We shall build an instance I over the
schema of q as follows:
for each valuation νq′ for the variables of q′ such that I ′ |= T1(x, y) ∧ T2(y, x)[νq′ ] do
Let νq(wm) = νq′(x)
Let νq(w1) = νq′(y)
Create a new constant cnew
for i := 2 to m− 1 do
Let νq(wi) = cnew
end for
Add the tuples of S1(wm, w1) ∧ S2(w1, w2) ∧ · · · ∧ Sm(wm−1, wm)[νq] to I
end for
We claim that consistentΣ(q′, I ′) = true iff consistentΣ(q, I) = true.
(⇒) Let I be a repair of I over the schema of q. We shall build a repair I ′ over the
schema of q′ as follows:
for each tuple S1(cm, c1) of I do
Add a tuple T1(cm, c1) to I ′for each cnew such that S2(c1, cnew) ∈ I and Sm(cnew, cm) ∈ I do
Add a tuple T2(c1, cm) to I ′end for
end for
Since consistentΣ(q′, I ′) = true, I ′ |= q′. Thus, there is a valuation νq′ such that
Chapter 5. Complexity-Theoretic Analysis 94
I ′ |= T1(x, y) ∧ T2(y, x)[νq′ ]. Let cm = νq′(x), c1 = νq′(y). Since T2(c1, cm) ∈ I ′, there
exists cnew such that S2(c1, cnew) ∈ I and Sm(cnew, cm) ∈ I. Let νq be a valuation for the
variables of q such that:
• νq(wm) = cm
• νq(w1) = c1
• νq(wi) = cnew, for 1 < i < m
Since T1(cm, c1) ∈ I ′, S1(cm, c1) ∈ I. By construction of νq, S2(c1, cnew) ∈ I and
Sm(cnew, cm) ∈ I. For 2 < i ≤ m, notice that by construction of I, there are no tuples
Si(ci, di) and Si(ci, d′i) in I such that di 6= d′i. Therefore, by Propositions 3.6 and 3.7,
every tuple in the extension of Si in I appears in the extension of Si in I. By construction
of I, Si(cnew, cnew) ∈ I, for 3 ≤ i ≤ m − 1. Thus, Si(cnew, cnew) ∈ I. We conclude that
I |= S1(wm, w1) ∧ S2(w1, w2) ∧ . . . Sm(wm−1, wm)[νq]. Thus, I |= q.
(⇐) Let I ′ be a repair of I ′. We shall build an instance I as follows.
for each tuple T1(cm, c1) of I ′ do
Add a tuple S1(cm, c1) to ILet cnew be a constant such that S2(c1, cnew) ∈ I and Sm(cnew, cm) ∈ I
Add a tuple S2(c1, cnew) to Ifor i := 3 to m− 1 do
Add a tuple Si(cnew, cnew) to Iend for
Add a tuple Sm(cnew, cm) to Iend for
It is easy to see that I is a repair of I. Since consistentΣ(q, I) = true, I |=q. Thus, there exists some valuation νq such that I |= S1(wm, w1) ∧ S2(w1, w2) ∧. . . Sm(wm−1, wm)[νq]. Let νq′ be such that:
• νq′(x) = νq(wm)
• νq′(y) = νq(wm1)
It is easy to see that I ′ |= T1(x, y) ∧ T2(y, x)[νq′ ]. Thus, I ′ |= q′.
Chapter 5. Complexity-Theoretic Analysis 95
5.2.3 Generalizing the Basic Cases
Our strategy for proving the dichotomy will be to show that if q has a subquery q′ that
is known to be intractable (in particular, a cycle), then q is not tractable. This does not
hold in general, but as we show with the next auxiliary result, it holds for the queries in
C∗.
Lemma 5.13. Let q be a Boolean query such that q ∈ C∗. Let R1(~x1, ~y1), . . . ,
Rn(~xn, ~yn) be the literals of q. Let q′ be a Boolean query. Let S1(x1, y1), . . . ,
Sm(xm, ym) be the literals of q′, where m ≤ n. Assume that the join graph of q′ is a cycle.
Let L = {x1, y1, . . . , xm, ym}. Assume that:
• xi occurs in ~xi, for 1 ≤ i ≤ m, and
• yi occurs in ~yi, for 1 ≤ i ≤ m, and
• for 1 ≤ i ≤ m, if w ∈ L and w occurs in Ri, then w occurs in Si.
Then, there is a polynomial-time reduction from the problem CONSISTENT(q′, Σ′) to
CONSISTENT(q, Σ).
Proof. Let F = {w : w occurs in Ri, and 1 ≤ i ≤ m}−L. Let U = {w : w occurs in q}−F − L.
Let I ′ be an instance over the schema of q′. We shall build an instance I over the
schema of q as follows:
for each variable w such that w ∈ F do
Create a new constant cnew
Let νF (w) = cnew
end for
for each valuation νq′ for the variables of q′ such that I ′ |= S1(x1, y1)∧· · ·∧Sm(xm, ym)[νq′ ]
do
for each variable w such that w ∈ F do
Let νq(w) = νF (w)
end for
for each variable w such that w ∈ U do
Create a new constant cnew
Let νq(w) = cnew
Chapter 5. Complexity-Theoretic Analysis 96
end for
for i := 1 to m do
Let νq(xi) = νq′(xi)
Let νq(yi) = νq′(yi)
end for
Add the tuples of R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq] to I
end for
We claim that consistentΣ(q′, I ′) = true iff consistentΣ(q, I) = true.
(⇒) Let I be a repair of I over the schema of q. We shall build an instance I ′ over
the schema of q′ as follows.
for i := 1 to m do
for each tuple Ri(~ci,~di) of I do
Let ci be the constant that appears in ~ci at the position of one of the occurrences
of xi in ~xi.
Let di be the constant that appears in ~di at the position of yi in ~yi
Add Si(ci, di) to I ′end for
end for
We make the following observations with respect to the construction of I ′. By con-
struction of I, if Ri(~ci,~di) ∈ I, the same constant appears in ~ci at all the positions where
xi appears in ~xi. By Proposition 3.6, I ⊆ I. Thus, in the construction of I ′, it suffices
to choose the constant that occurs in ~ci at any of the positions where xi occurs in ~xi.
Assume that I ′ is not a repair of I ′. Then, there are constants ci, di and d′i such
that di 6= d′i, Si(ci, di) ∈ I ′ and Si(ci, d′i) ∈ I ′. By construction of I ′, there are tuples
Ri(~ci,~di) ∈ I and Ri(~c
′i,
~d′i) ∈ I such that ci appears in ~ci and ~c′i at all the positions
where xi appears in ~xi; and di and d′i appear in ~di and ~d′i, respectively, at the position
of yi in ~yi. Clearly, ~di 6= ~d′i. By construction of I, if w is a variable such that w 6∈ L,
w is assigned the value νF (w) in every tuple of I. By Proposition 3.6, I ⊆ I. Thus,
~ci = ~c′i. Since ~di 6= ~d′i, I does not satisfy the key constraints of Σ. Thus I is not a repair;
contradiction. We conclude that I ′ is a repair of I ′.
Since consistentΣ(q′, I ′) = true, I ′ |= q′. Thus, there is some valuation νq′ such
Chapter 5. Complexity-Theoretic Analysis 97
that I ′ |= S1(x1, y1) ∧ · · · ∧ Sm(xm, ym)[νq′ ]. Let νm be a valuation for the variables of
R1, . . . , Rm such that:
• νm(xi) = νq′(xi), for 1 ≤ i ≤ m
• νm(yi) = νq′(yi), for 1 ≤ i ≤ m
• νm(w) = νF (w) if w ∈ F
Let w be a variable that appears in Ri, for 1 ≤ i ≤ m. If w ∈ L and w occurs in
Ri, by hypothesis, w occurs in Si. If w 6∈ L, then w ∈ F , by definition of F . Since
I ′ |= S1(x1, y1) ∧ · · · ∧ Sm(xm, ym)[νq′ ], and νm(w) = νF (w) if w ∈ F , we conclude that
I |= R1(~x1, ~y1) ∧ · · · ∧Rm(~xm, ~ym)[νm].
By construction of I, there is a valuation νq for the variables of q such that:
• νm(w) = νq(w) if w appears in Ri, for 1 ≤ i ≤ m; and
• I |= R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq].
Let Ri(~xi, ~yi) be a literal of q such that i > m. Notice that we assume that the join
graph of q′ is a cycle. Since q is in C∗, there exists some variable w such that w occurs in
~xi and w does not occur in any of R1, . . . , Rm. Thus, w ∈ U . Since the variables of U are
assigned a distinct constant in every iteration of the algorithm that constructs I, if two
tuples Ri(~ci,~di) and Ri(~c
′i,
~d′i) are added at different iterations, then ~ci 6= ~c′i. Therefore,
by Proposition 3.6 and 3.7, every tuple in the extension of Ri in I is in the extension of
Ri in I. Therefore, I |= R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq].
(⇐) Let I ′ be a repair of I ′. We shall build an instance I as follows.
for i := 1 to m do
for each tuple Si(ci, di) of I ′ do
Let Ri(~ci,~di) be a tuple of I such that ci appears in ~ci at all the positions of xi in
~xi, and di appears in ~di at the position of yi in ~yi
Add Ri(~ci,~di) to I
end for
end for
for i := m + 1 to n do
for each tuple Ri(~ci,~di) in I do
Chapter 5. Complexity-Theoretic Analysis 98
Add Ri(~ci,~di) to I
end for
end for
We will now show that I is a repair of I. Towards a contradiction, assume that I is
not a repair of I. Then, there are values ~ci, ~di, and ~d′i such that ~di 6= ~d′i, Ri(~ci,~di) ∈ I,
and Ri(~ci,~d′i) ∈ I.
First, assume that 1 ≤ i ≤ m. For every variable w such that w 6∈ L and w occurs
in Ri, w ∈ F . Thus, w is assigned the same constant νF (w) in every tuple of I. By
Proposition 3.6, I ⊆ I. Therefore, there are constants ci, di and d′i such that di 6= d′i, ci
appears in ~ci at the positions of xi in ~xi, and di and d′i appears in ~di and ~d′i, respectively,
at the position of yi in ~yi. By construction of I, there are tuples Si(ci, di) and Si(ci, d′i)
in I ′. Since di 6= d′i, I ′ does not satisfy the key constraints of Σ′. Thus, I ′ is not a repair;
contradiction.
Now, assume that m < i ≤ n. Notice that we assume that the join graph of q′ is
a cycle. Since q is in C∗, there exists some variable w such that w occurs in ~xi and
w does not occur in any of R1, . . . , Rm. Thus, w ∈ U . Since the variables of U are
assigned a different constant in every iteration of the algorithm that constructs I, if two
tuples Ri(~ci,~di) and Ri(~c
′i,
~d′i) are added at different iterations, then ~ci 6= ~c′i. Therefore,
the extension of Ri in I satisfies the key dependencies of Σ. Thus, by construction of
I, the extension of Ri in I satisfies the key constraints of Σ. Thus, I is a repair of I;
contradiction.
We conclude that I is a repair of I. Since consistentΣ(q, I) = true, I |= q. Thus,
there exists some valuation νq such that I |= R1(~x1, ~y1) ∧ · · · ∧Rn(~xn, ~yn)[νq]. Let νq′ be
a valuation for the variables of q′ such that, for 1 ≤ i ≤ m:
• νq′(xi) = νq(xi)
• νq′(yi) = νq(yi)
It is easy to see that I ′ |= S1(x1, y1) ∧ · · · ∧ Sm(xm, ym)[νq′ ]. Thus, I ′ |= q′.
We are now ready to prove Lemma 5.14, which gives a polynomial-time reduction
from the problem of computing consistent answers for the queries of Lemmas 5.1 or 5.11
to every query in Chard. From this, Theorem 5.5 follows directly.
Chapter 5. Complexity-Theoretic Analysis 99
Lemma 5.14. Let q be a query such that q ∈ Chard. Then, there is a polynomial-time
reduction from CONSISTENT(q′, Σ′) to CONSISTENT(q, Σ), where q′ is one of the following
queries:
• ∃x, x′, y.S1(x, y) ∧ S2(x′, y)
• ∃x, y.T1(x, y) ∧ T2(y, x)
Proof. Let G be the join graph of q. Let G′ be an induced subgraph of G such that:
• G′ is connected, and
• G′ is not a tree, and
• if G′′ is a proper induced subgraph of G′, and G′′ is connected, then G′′ is a tree.
Let P = 〈R1, R2, R1〉 be a cycle of G′. Let R1(~x1, ~y1) and R2(~x2, ~y2) be the literals in
G′. Assume that there is some variable y such that y occurs in ~y1 and ~y2. By Definition
of C∗, there is no key-to-key join between R1 and R2. Therefore, there exists a variable
x such that x occurs in ~x1, and x does not occur in ~x2; and a variable x′ such that x′
occurs in ~x2 and x′ does not occur in ~x1. Let q′ = S1(x, y) ∧ S2(x′, y). By Lemma 5.13,
there is a polynomial-time reduction from CONSISTENT(q′, Σ′) to CONSISTENT(q, Σ).
Let P = 〈R1, . . . , Rm, R1〉 be a cycle of G′. Let R1(~x1, ~y1),. . . , Rm(~xm, ~ym) be the
literals of P . Let w1, w2, . . . , wm be variables such that wi occurs in ~yi and in R(i mod m)+1,
for every 1 ≤ i ≤ m. Assume that there is some wi such that 1 ≤ i ≤ m and wi occurs in
some literal Rj of q such that j 6= i and j 6= (i mod m)+1. Then {R1, . . . , Ri, Rj, . . . , R1}is a cycle. Therefore G′ contains a proper induced subgraph G′′ such that G′′ is connected,
and G′′ is not a tree; contradiction. Let q′′ = S1(wm, w1)∧S2(w1, w2)∧ . . . Sm(wm−1, wm).
It can be checked that q and q′′ satisfy the conditions of Lemma 5.13. Consequently,
there is a polynomial-time reduction from CONSISTENT(q′′, Σ′′) to CONSISTENT(q, Σ). Let
q′ = ∃x, y.T1(x, y)∧T2(y, x). By Lemma 5.12, there is a polynomial-time reduction from
CONSISTENT(q′, Σ′) to CONSISTENT(q′′, Σ′′).
Finally, we give the proof for Theorem 5.5, the main result of this chapter.
Theorem 5.5. Let q be a query such that q ∈ Chard. Then, CONSISTENT(q, Σ) is coNP-
complete in data complexity.
Chapter 5. Complexity-Theoretic Analysis 100
Proof. By Lemma 5.10, CONSISTENT(q, Σ) is in coNP. In order to prove hardness, let q′
be one of the following queries:
• ∃x, x′, y.S1(x, y) ∧ S2(x′, y)
• ∃x, y.T1(x, y) ∧ T2(y, x)
By Lemma 5.14, there is a polynomial-time reduction from CONSISTENT(q′, Σ′) to
CONSISTENT(q, Σ). By Lemmas 5.1 and 5.11, CONSISTENT(q′, Σ′) is coNP-hard. Thus,
CONSISTENT(q, Σ) is coNP-hard.
5.3 Related Work
Chomicki and Marcinkowski [CM05] and Calı, Lembo and Rosati [CLR03a] thoroughly
study the decidability and complexity of consistent query answering for several classes
of queries and integrity constraints. In order to show intractability of a class, they
take the usual approach of exhibiting one query of the class for which the problem is
intractable. To the best of our knowledge, the result that we present in Section 5.2 is the
first dichotomy result in the area of consistent query answering.
Both Chomicki and Marcinkowski and Calı, Lembo and Rosati show that the problem
of obtaining consistent answers for conjunctive queries under primary key constraints is
coNP-complete. Chomicki and Marcinkowski also show an example of a query with just
one literal but two key dependencies for which the problem is coNP-complete. This gives
further support for our decision of considering exactly one key dependency per relation.
Calı, Lembo and Rosati show the undecidability of the problem of obtaining consis-
tent answers when the set of constraints contains primary keys and arbitrary inclusion
dependencies. They also show the problem becomes decidable for foreign key constraints
(it is coNP-complete). Chomicki and Marcinkowski study the same problem but under
a semantics where only tuple deletion is allowed (i.e., repairs are always subsets of the
inconsistent database). In this case, the problem is Π2p-complete, and becomes coNP-
complete if the inclusion dependencies are restricted to be acyclic.
Chapter 6
ConQuer: System Implementation
and SQL Rewritings
In this chapter, we present ConQuer, a system for querying inconsistent databases.
We demonstrated this system at the International Conference on Very Large Databases
(VLDB) [FFM05b]. In Section 6.1, we describe the system implementation and a typical
scenario where it can be used. Then, in Sections 6.2 and 6.3, we present the SQL rewrit-
ings that are at the core of ConQuer’s approach. In Section 6.4, we show how, if desired,
ConQuer can process the database offline in order to improve the performance of the
queries. Finally, in Section 6.5, we review other systems that are related to ConQuer.
6.1 System Implementation
ConQuer is implemented in Java and follows a modular architecture. It consists of the
following components:
• Query Rewriting Module. It rewrites an input SQL query into another SQL
query that computes the consistent answers. The details of the rewritings are
presented in Sections 6.2 to 6.4. The SQL queries are parsed using javacc.
• Query Execution Engine. The rewritten queries are executed using IBM DB2
UDB Version 8.2. The connection with the database is done through JDBC.
• Conflict Resolution Module. Provides a tracing facility to find the data that
leads to differences between the answer to the original query and the consistent
answer. This module also permits a user to update the database to correct errors.
101
Chapter 6. ConQuer: System Implementation and SQL Rewritings 102
Figure 6.1: Interface for entering hypothetical primary key constraints in ConQuer
• User Interface. Query results are displayed using a Web-accessible interface that
is implemented in PHP.
We illustrate a typical use case of ConQuer on a database with information about
airports. The user first specifies a set of primary key constraints using the interface shown
in Figure 6.1. These are the constraints that should hold on a consistent database, but
may be violated by the actual database that is being queried. Notice that for the same
schema and database, there is the flexibility of running queries under different sets of
potentially violated primary key constraints. Then, the user writes a SQL query within
the interface. In Figure 6.2, we show a query where the user is asking for all the countries
that have airports located north of parallel 63N. The result to the query is shown in Figure
6.3. The consistent answers are shown in bold, and the “potential answers” (i.e., possible
answers that are not consistent answers) are shown in italics. For example, in this case
“Italy” is a potential answer.
While consistent answers are best suited for decision making, potential answers can be
used to understand the reasons why a database is inconsistent. In this case, the user could
click on “Italy” and obtain an explanation, which is shown in Figure 6.4. The explanation
is the lineage (or why-provenance) [BKT01, CW03] of the result, i.e., the tuples in the
database that contribute to the answer. According to the explanation, Italy is a potential
answer because it has one airport that appears as satisfying the query (parallel 63) in
Chapter 6. ConQuer: System Implementation and SQL Rewritings 103
Figure 6.2: Interface for entering queries in ConQuer
one tuple, and violating it (parallel 45) in another. Notice that in the comment to the
query, the user wrote “select countries that are located north of Trondheim”. Trondheim
is a Norwegian city, and the user may have background knowledge telling that all Italian
cities are south of Norwegian cities. Thus, the user could use the explanation obtained
from ConQuer in order to remove the tuple for the Italian airport located on parallel 63.
6.2 ConQuer Rewritings for Queries without Aggre-
gation
In this section, we present the SQL rewritings produced by ConQuer for a class of Select-
Project-Join (SPJ) queries with set semantics. We delay the treatment of conjunctive
queries that return duplicates until the next section, where the number of duplicates
returned by the queries can be counted with the count(*) aggregate function. We first
give the query rewriting algorithm, and then we illustrate it with a number of examples.
6.2.1 Rewriting Algorithm
We now present a SQL rewriting algorithm for SPJ queries that are equivalent to a
conjunctive query in the class Cforest, introduced in Definition 3.4, which we repeat next.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 104
Figure 6.3: Query results in ConQuer
Figure 6.4: Query explanation in ConQuer
Chapter 6. ConQuer: System Implementation and SQL Rewritings 105
Definition 3.4. Let q be conjunctive query without repeated relation symbols and all of
whose nonkey-to-key joins are full. Let G be the join graph of q. We say that q ∈ Cforest
if G is a forest (i.e., every connected component of G is a tree).
The above definition requires three conditions on the conjunctive query. First, that
the query has no repeated relation symbols. For an SPJ SQL query, this means that each
relation can be used at most once in the where clause. Second, that all its nonkey-to-key
joins must be full. For an SPJ query, this means that if an attribute of a key of a relation
r1 is equated in the where clause with a nonkey attribute of another relation r2, then all
the attributes of the key of r1 are equated to nonkey attributes of r2. Finally, the join
graph of q must be a forest. The notion of a join graph is introduced in Definition 3.1,
and we repeat it next.
Definition 3.1 (join graph). Let q be a conjunctive query. The join graph G of q is a
directed graph such that:
• the vertices of G are the literals of q;
• there is an arc from literal Ri to literal Rj if i 6= j, and there is some variable w
such that w is existentially-quantified in q, w occurs at the position of a nonkey
attribute in Ri, and w occurs in Rj.
An analogous definition can be given for the join graph of an SPJ SQL query. The
vertices of the graph will be the relation symbols in the from clause of the query. Fur-
thermore, there will be an arc from relation ri to relation rj if there is an attribute A
in ri such that (1) A is not in the key of r1 (it is a nonkey attribute), (2) A does not
appear in the select clause of the query, and A is not equated to any attribute B such
that B appears in the select clause of the query (this corresponds to the notion of
an existentially-quantified variable for conjunctive queries); and (3) there is some equal-
ity in the where clause relating A to some attribute B of r2 (i.e., a nonkey-to-key or
nonkey-to-nonkey join).1
We can now give a definition analogous to Cforest for SPJ SQL queries. A query q is
in class Csqlforest if no relation appears twice in the from clause of q, all the nonkey-to-key
joins of q are full, and the join graph of q is a forest.
1This definition works for repeated relation symbols as well. In such case, we assume that if a relationappears more than once in the from clause, then it is aliased to a new name using the as operator.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 106
We are now ready to give ConQuer’s rewriting algorithm for SPJ queries in Csqlforest.
The algorithm is called RewriteForestSQL and is shown in Figure 6.5. The algorithm
takes as input a SQL query q in Csqlforest and a set of key constraints (one per relation of
the schema), and returns a SQL rewriting Q of q.
In the rewriting Q, the attributes of the relations in q play different roles. In par-
ticular, we will distinguish the attributes that the query projects on (i.e., that appear
in the select clause), and the attributes that appear in the key of a relation that is
at the root of some tree in the join graph of q. In the rest of the discussion, we will
call these attributes projecting attributes, and key-root attributes, respectively. The for-
mer are denoted in Figure 6.5 with the symbols S1, . . . , Sl; the latter are denoted with
K1, . . . , Kn.
The rewriting Q has three subqueries, specified using a with clause: candidates-
SubQuery, countViolSubQuery and countProjSubQuery. The purpose of candidates-
SubQuery is to prune the number of values for the key-root attributes that should be
considered by the other subqueries. In particular, candidatesSubQuery applies the
selection conditions of the original query q, and projects on its key-root attributes. These
attributes are used to perform an inner join in the next subquery (countViolSubQuery).
If the selectivity of q is low (i.e., few tuples satisfy its conditions), and the query optimizer
pushes down the selection conditions of candidatesSubQuery in the query plan, we would
expect the rewriting to have a low overhead with respect to the original query. We validate
this conjecture in Section 7.2.
Let CONDS be the list of conditions in the where clause of q. In the from clause
of countViolSubQuery, we count the number of tuples that violate the conditions of
CONDS, we group by the key-root attributes, and keep the result in an attribute called
countViol as follows:
sum(case when CONDS then 0 else 1 end)
over (partition by K1, . . . , Kn)
as countViol
Notice the use of the partition by clause. This clause (introduced in the OLAP
Amendment to SQL [ISO01]) differs from the typical group by clause in that it permits
grouping by a set of attributes that may not include all the attributes in the select
Chapter 6. ConQuer: System Implementation and SQL Rewritings 107
clause. This is useful here because we “partition by” the root-key attributes, but the
select clause of countViolSubQuery also includes the projecting attributes of the query.
In the main body of the query, we filter out the tuples whose key-root attributes are
involved in a violation of CONDS by checking the condition countViol=0.
The from clause of subquery countViolSubQuery is obtained by calling a procedure
called GetJoinsExpression (shown in Figure 6.6), with the join graph of q and the list
of conditions CONDS as parameters. This procedure consists of two parts. In the first
part, an inner join is computed for the key-to-key joins of relations that are at the root
of some tree of the join graph. Notice that since these relations are in distinct connected
components of the join graph, they are not related by a nonkey-to-key join. In the second
part, the procedure produces a left outer join expression for each tree of the join graph.
This is done by recursively calling the procedure GetTreeJoinsExpression for the nodes
of each tree (also shown in Figure 6.6). The expression returned by GetTreeJoinsExpres-
sion is a left outer join of all relations in the input tree, listed in an order corresponding
to a preorder traversal of the trees.
We will illustrate shortly (in Example 6.4) the rewriting for queries where some of
the root-key attributes do not appear in the select clause (that is, some root-key at-
tributes are not projecting attributes). We will argue that in such cases, we would
like to count the number of distinct values for the projecting attributes, grouping by
the root-key attributes. We will also show how to do this by using the max aggre-
gate function (with a partition by clause) and the rank OLAP function. In the al-
gorithm RewriteForestSQL of Figure 6.5, the rank function is used in the subquery
countViolSubQuery, and the max function is used in the subquery countProjSubquery.
The result of this aggregation is kept in an attribute called countProjection, which
keeps the count of distinct values for each instantiation of the root-key variables. This
attribute is used in the main body of the rewriting, where we check countProjection=1.
In the subqueries, we project not only on the projecting attributes S1, . . . , Sl, but
also on the root-key attributes K1, . . . , Kn. However, in the main query of the rewriting
we project only on the attributes S1, . . . , Sl. In this way, the rewritten query Q and the
input query q return tuples for the same set of attributes.
For the sake of clarity, we omitted the order by clause in the query q. However,
dealing with ordering in the rewriting is quite easy. We just need to add the attributes
of the order by of q to the select clause of the subqueries, and include them in the
order by clause of the main body of the rewriting.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 108
Algorithm RewriteForestSQL(q,Σ)
Input: q, a SQL query in Csqlforest of the form
select <list of attributes>from <list of relations>where <list of conditions>
Σ, a set of key constraints (one per relation)Output: Q, a SQL query that computes consistentΣ(q, I), for every database I
Let S1, . . . , Sl be the attributes in the select clause of qLet G be the join graph (forest) of qLet r1, . . . , rm be the relations at the root of all trees of GLet K1, . . . , Kn be the attributes in the keys of r1, . . . , rm
Let CONDS be the list of conditions in the where clause of qLet JOINS be the expression obtained by calling the procedure
GetJoinsExpression(G, CONDS) of Figure 6.6Let Q be the following SQL query:
with candidatesSubQuery as (select K1 as cK1,. . . ,Kn as cKn
from <list of relations in q>where CONDS ),
countViolSubQuery as (select K1, . . . , Kn,
S1, . . . , Sl,rank() over (partition by K1, . . . , Kn
order by S1, . . . , Sl) as rankProjection,sum(case when CONDS then 0 else 1 end)
over (partition by K1, . . . , Kn) as countViol,from JOINS ),where exists (select * from candidatesSubQuery
where K1 = cK1 and . . . and Kn = cKn),countProjSubQuery as (
where S1, . . . , Sl, A1, . . . , Au are attributes of the relations in the from clause, and
F1, . . . , Fu may be any of the aggregation functions min, max, and sum.
We are now ready to give ConQuer’s rewriting for queries in Csqlaggforest. The algorithm
is called RewriteAggSQL, and is shown in Figure 6.8. It takes as input a SQL query q in
class Csqlaggforest and a set of key constraints (one per relation of the schema), and returns
a SQL rewriting Q of q.
In the rewriting Q, the attributes of the relations in q play different roles. As in
the algorithm RewriteForestSQL for queries without aggregation, we have projecting
and key-root attributes. The former are the attributes that q projects on (i.e., that
appear in its select clause), and the latter are the attributes that appear in the key
of a relation that is at the root of some tree in the join graph of q. In addition, in
RewriteAggSQL, we have aggregation attributes, that is the attributes that appear as
arguments of some aggregation function of q. In Figure 6.8, we denote the projecting
attributes with the symbols S1, . . . , Sl; the key-root attributes with K1, . . . , Kn; and the
aggregation attributes with A1, . . . , Au.
We denote the aggregation functions of q with F1, . . . , Fu. In the figure, we assume
that the 0-ary function count(*) is present in the query (but during the explanation it
will be easy to see what can be dropped if count(*) is not present).
The rewriting Q has five subqueries, specified using a with clause: candidatesSub-
Query, countViolSubQuery, contribAllSubQuery, contribConsistentSubQuery, and
contribNonConsistentSubQuery.
As in the algorithm RewriteForestSQL, the purpose of candidatesSubQuery is to
determine the values for the key-root attributes that should be considered by the other
subqueries. The subquery countViolSubQuery has the same purpose (counting the num-
ber of violations per key-root value) as the subquery of the same name in the rewrit-
ing RewriteForestSQL. One difference is that here we need to compute the attribute
Chapter 6. ConQuer: System Implementation and SQL Rewritings 123
satConds which keeps track of whether each tuple satisfies the conditions of the query
(denoted as CONDS). The other difference is that in the select clause of the subquery,
we must project on the aggregation attributes since their values are needed to perform
aggregation in the rest of the rewriting.
The other three subqueries are used to compute the “contributions” to the lower and
upper bounds of each aggregate result. The subquery contribAllSubQuery computes,
for each instantiation of the key-root and projecting attributes, the minimum and max-
imum value for each aggregation attribute. In particular, in the subquery we group by
K1, . . . , Kn, S1, . . . , Sl (the key-root and projecting attributes), and for each aggregation
Fi(Ai) in the select clause of q, we compute attributes bottomAi and topAi as min(Ai)
and max(Ai), respectively. We also compute an attribute countProjection, to keep
track of the projection on nonkey attributes.
The subqueries contribConsistentSubQuery and contribNonConsistentSubQuery
compute the contribution of the “consistent” and “nonconsistent” tuples to the aggre-
gation. The former are the tuples whose key-root values satisfy the following two con-
ditions. First, they have the same value for the projecting attributes in every tuple
where they appear (checked with condition countProjection = 1). Second, they are
not involved in a violation of the selection conditions CONDS in any of the tuples where
they appear (checked with condition countViol=0). The tuples that violate at least
one of these conditions are considered “nonconsistent” and dealt with in the subquery
contribNonConsistentSubQuery.
For the “consistent” tuples, the contributions computed in contribConsistentSub-
Query correspond to the bottom and top values from contribAllSubQuery. That is,
the attributes bottomAi and topAi of contribAllSubQuery appear in the select clause
of contribConsistentSubQuery. The computation of the contributions of the “noncon-
sistent” tuples is more involved. In contribNonConsistentSubQuery, the expression of
the select clause that handles the contributions is obtained by calling the procedure
GetBoundsNonConsistent given in Figure 6.9. Notice in the figure that the contributions
are different depending on the aggregation function. The rationale and correctness proof
for these contributions were given in Chapter 4. In the figure, we do not include the 0-ary
operator count(*). For this operator, we need to return the attributes bottomCount and
topCount with values of zero and one, respectively.
In the subqueries, we project not only on the projecting attributes S1, . . . , Sl but
also on the root-key attributes K1, . . . , Kn. However, in the main query of the rewriting
Chapter 6. ConQuer: System Implementation and SQL Rewritings 124
we project and group by only the attributes S1, . . . , Sl (i.e., we project out the key-root
attributes). In this way, the rewritten query Q and the input query q return tuples
for the same set of attributes. We also compute the greatest lower bound (glbAi) and
lowest upper bound (lubAi) for each tuple of values for the projecting attributes. This
is obtained by performing the corresponding aggregation function (min, max, or sum) on
the top and bottom values computed in the previous subqueries. For the 0-ary func-
tion count(*), the bounds are computed by summing up the values of the attributes
bottomCount and topCount from the previous subqueries. Notice that there is also a
condition having sum(bottomCount) > 0. This is included in order to ensure that the
tuples for the projecting attributes are consistent answers.
For the sake of clarity, we omitted the order by clause in the query q. However,
dealing with ordering in the rewriting is quite easy. We just need to add the attributes
of the order by clause of q to the select clause of the subqueries, and finally add an
order by clause to the main subquery. The only special case that must be considered
is when an aggregate attribute appears in the order by clause. Since for each aggregate
attribute of q we have two attributes in the rewritten query (one for each bound), we
must (arbitrarily) decide whether the ordering will be by either the greatest lower or the
lowest upper bound.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 125
Algorithm RewriteAggSQL(q, Σ)
Input: q, a SQL query in Csqlaggforest of the form
select <list of attributes>,<list of aggregation functions>from <list of relations>where <list of conditions>group by <list of attributes>
Σ, a set of key constraints (one per relation)Output: Q, a SQL query that computes aggconsistentΣ(q, I) for every database I
Let F1(A1), . . . , Fu(Au) be the aggregation function applications in the select clauseof the query, where each Fi is an aggregation function, and each Ai is an attributefrom a relation that appears in the from clause
Let S1, . . . , Sl be the attributes in the select clause of q (by definition of Csqlaggforest,
these are the attributes in the group by clause as well)Let G be the join graph (forest) of qLet r1, . . . , rm be the relations at the root of some tree of GLet K1, . . . , Kn be the attributes in the keys of r1, . . . , rm
Let CONDS be the list of conditions in the where clauseLet JOINS be the expression obtained by calling the procedure
GetJoinsExpression(G, CONDS) of Figure 6.6Let Q be the following SQL query:
select S1, . . . , Sl,F(bottomA1) as glbA1,F(topA1) as lubA1,. . . ,F(bottomAu) as glbAu,F(topAu) as lubAu,
sum(bottomCount) as glbCount, sum(topCount) as lubCount
from
( select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery ) q
group by S1, . . . , Sl
having sum(bottomCount)>0
return Q
Figure 6.8: SQL query rewriting algorithm for SPJ queries in Csqlaggforest
Chapter 6. ConQuer: System Implementation and SQL Rewritings 127
Algorithm GetBoundsNonConsistent
Input: Fi, one of the aggregation functions sum, min, max
Ai, an attribute
Output: a subexpression of a SQL query
if Fi = sum then
return “case when
bottomAi < 0 then bottomAi
else 0 end as bottomAi,
case when
topAi > 0 then topAi
else 0 end as topAi”
end if
if Fi = min
return “bottomAi, 0 as topAi”
end if
if Fi = max
return “0 as bottomAi, topAi”
end if
Figure 6.9: Algorithm to obtain the bottom and top contributions of “nonconsistent”
tuples
6.3.2 Examples
We next illustrate the rewriting for a query that uses the count aggregation function.
Example 6.6. Let R be a schema with relation employee(emplKey, salary, age). Con-
sider a SQL query q5 that, for each age in the database, gives the number of occurrences
of the age on tuples for employees whose salary is less than or equal to 1000.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 128
q5: select age, count(*)
from employee
where salary <= 1000
group by age
In the aggregate conjunctive query notation introduced in Chapter 4, q5 can be written
as follows.
q5(a, cnt) = select a, count(*)
from employee(e, s, a) ∧ s ≤ 1000
group by a
The above query is in the class Caggforest for which we gave a query rewriting algorithm
in Chapter 4. A key idea of that algorithm is to first produce a first-order rewriting for
a conjunctive query, and then perform aggregation on the result of the first-order query.
For our example, this conjunctive query is q′(e, a) = ∃s.employee(e, s, a)∧ s ≤ 1000. Let
us call QConsistent(e, s) to the result of invoking RewriteForest(q′, Σ) (the algorithm
introduced in Chapter 3).
Let Q5 be the query rewriting for q5 obtained by invoking RewriteCount(q5, Σ) (the
algorithm of Figure 4.1 of Chapter 4). In that rewriting, the greatest lower bound is
obtained as follows:
QGlb(s, glb)= select s, count(*)
from QConsistent(e, s)
group by s
Notice that aggregation is performed on the result of the first-order query QConsistent(e, s).
Thus, for computing the greatest lower bound in the SQL rewriting, we can reuse the al-
gorithm RewriteForestSQL introduced in Section 6.2. In particular, we will use the next
two subqueries, which are similar to those that would be produced by RewriteForestSQL(q′, Σ)
(we will show the differences next).
with candidatesSubQuery as (
select emplKey
from employee
where salary <= 1000 )
Chapter 6. ConQuer: System Implementation and SQL Rewritings 129
with countViolSubQuery as (
select emplKey,age,
rank() over (partition by emplKey
order by age) as rankProjection,
sum(case when salary <= 1000 then 0 else 1 end)
over (partition by emplKey) as countViol,
case when salary <= 1000 then ‘‘yes’’ else ‘‘no’’ end
as satConds
from employee
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey) )
with contribAllSubQuery as (
select emplKey,age,
max(rankProjection) over (partition by emplKey)
as countProjection,
countViol
from rankProjSubQuery
where satConds=‘‘yes’’
group by emplKey,age,countViol,rankProjection )
The above subqueries differ from the ones that would be produced by Rewrite-
ForestSQL in the following aspects. In countViolSubQuery, we compute an attribute
satConds that keeps track of whether each tuple satisfies or violates the selection con-
dition of q5 (i.e., that the salary is less than or equal to 1000). This is different from
the attribute countViol because countViol counts the violations for all tuples where a
key value (employee name, in this case) appears, whereas satConds may take different
values on different tuples of the same employee, depending on the salary that appears in
the tuple. The third subquery corresponds to the subquery countProjSubQuery of the
Chapter 6. ConQuer: System Implementation and SQL Rewritings 130
algorithm RewriteForestSQL, but it has a different name here (contribAllSubQuery)
because, as we will show shortly, it is used to compute the “contribution” of each tuple
to the lower and upper bounds of count(*). In this subquery, we check the condition
satConds=‘‘yes’’. The intuitive reason is that the tuples that do not satisfy the con-
ditions of q5 (and hence satConds = ‘‘no’’) do not contribute neither to the lower nor
to the upper bound of count(*), and should thus be filtered out.
Let us now consider the computation of the lowest upper bound. In the query Q5
returned by RewriteCount, this bound is obtained as follows:
QLub(a, lub) = select a, count(*)
from q′(e, a) ∧ (∃e.QConsistent(e, a))
group by s
In this case, aggregation is done on the result of the following first-order expression:
q′(e, a)∧(∃e.QConsistent(e, a)). The naive way of writing this expression in SQL may be
inefficient because QConsistent already contains q′ as a subexpression. A more efficient
way of writing Q5 in SQL involves computing the “contributions” of each tuple to the
value of count(*), with the two subqueries shown next.
One of the subqueries (called contribConsistentSubQuery) computes the contribu-
tion of the “consistent” tuples. These are the tuples for employees that (1) have the
same age (the attribute in the select clause of q5) in every tuple where they appear;
and (2) are not involved in a violation of the conditions of q5 in any of the tuples where
they appear (i.e., their salary is always less than or equal to 1000). This can be checked
with the condition countProjection = 1 and countViol=0. In addition, the subquery
has attributes bottomCount and topCount that are used in the main body of the query
to combine the contributions of the “consistent” and “nonconsistent” tuples. For the
consistent tuples, the contribution is one to both the lower and upper bounds.
with contribConsistentSubQuery as (
select emplKey,age
1 as bottomCount
1 as topCount
from contribAllSubQuery
where countProjection = 1 and countViol=0 )
Chapter 6. ConQuer: System Implementation and SQL Rewritings 131
The other subquery (called contribNonConsistentSubQuery) computes the contri-
butions of the “nonconsistent” tuples. We give this name to the tuples that are not
in the consistent answer of q′, but do satisfy q′. These tuple do not contribute to
the greatest lower bound of count(*), but they may contribute to the lowest upper
bound. In the SQL rewriting, the nonconsistent tuples are captured with the condition
countProjection > 1 or countViol >= 1. In addition, the subquery has attributes
bottomCount and topCount that are used in the main body of the query to combine
the contributions of the “consistent” and “nonconsistent” tuples. For the nonconsistent
tuples, the contribution is zero to the lower bound and one to the upper bound (compare
this to the consistent tuples, which contribute one to both bounds).
with contribNonConsistentSubQuery as (
select emplKey,age
0 as bottomCount,
1 as topCount
from contribAllSubQuery
where countProjection > 1 or countViol >= 1 )
Finally, the main body of the rewriting sums ups the contributions of each tuple to the
lower and upper bounds, and projects out the attribute emplKey. The condition having
sum(bottomCount)>0 is used to ensure that we return ages that are consistent answers.
As we mentioned before, this corresponds to checking the condition ∃e.QConsistent(e, a).
select age
sum(bottomCount) as glb,
sum(topCount) as lub
from
( select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery ) q
group by age
having sum(bottomCount)>0
In the next example, we illustrate the rewriting for a query that has the sum aggre-
gation function. The rewritings for the min and max aggregation functions are similar.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 132
Example 6.7. Consider the same schema as in the previous example. Let q6 be a SQL
query that, for each age in the database, gives the sum of all salaries in the database
that are less or equal than 1000.
q6: select age, sum(salary)
from employee
where salary <= 1000
group by age
The SQL rewriting of q6 is computed by ConQuer along the same lines of the rewriting
for query q5 of the previous example. As in that example, the rewriting starts with three
subqueries: candidatesSubQuery, countViolSubQuery and contribAllSubQuery. The
subquery countViolSubQuery counts the number of violations of the selection condition
for each key value (age), and is the same as in the previous example, except that it
includes the attribute salary in its select clause. The subquery contribAllSubQuery
computes the contribution of all key values to the final result. The only difference with
the previous example is that here we compute the minimum and maximum salary for
each employee (attributes bottomSalary and topSalary). This was not necessary in
the previous example since count(*) is a 0-ary function, whereas sum is a unary function
(in this case, taking the argument salary).
with candidatesSubQuery as (
select emplKey
from employee
where salary <= 1000 )
with countViolSubQuery as (
select emplKey,age,salary,
rank() over (partition by emplKey
order by age) as rankProjection,
sum(case when salary <= 1000 then 0 else 1 end)
over (partition by emplKey) as countViol,
case when salary <= 1000 then ‘‘yes’’ else ‘‘no’’ end
as satConds
from employee
Chapter 6. ConQuer: System Implementation and SQL Rewritings 133
where exists (select *
from candidatesSubQuery C where
C.emplKey=employee.emplKey) )
with contribAllSubQuery as (
select emplKey,age,
min(salary) as bottomSalary,
max(salary) as topSalary,
max(rankProjection) over (partition by emplKey)
as countProjection,
countViol
from rankProjSubQuery
where satConds=‘‘yes’’
group by emplKey,age,countViol,rankProjection )
Then, as in the previous example, the rewriting computes the contributions from the
“consistent” and “nonconsistent” tuples. For clarity of presentation, we will assume that
all salaries are positive values (but in the general algorithm we deal with the case of
negative values as well). For the “consistent tuples” (whose contributions are computed
in contribConsistentSubQuery), the bottom and top salaries computed in contribAll-
SubQuery contribute to the greatest lower bounds and lowest upper bounds, respectively.
The top salary also contributes to the lowest upper bound of the “nonconsistent” tuples
(whose contributions are computed in contribNonConsistentSubQuery). However, as
we explained in Chapter 4, the bottom salary does not contribute to the greatest lower
bound. Therefore, the attribute bottomSalary of contribNonConsistentSubQuery gets
a value of zero.
with contribConsistentSubQuery as (
select emplKey,age,
bottomSalary,
topSalary,
Chapter 6. ConQuer: System Implementation and SQL Rewritings 134
1 as bottomCount
from contribAllSubQuery
where countProjection = 1 and countViol=0 )
with contribNonConsistentSubQuery as (
select emplKey,age
0 as bottomSalary,
topSalary,
0 as bottomCount
from contribAllSubQuery
where countProjection > 1 or countViol >= 1 )
Finally, the main body of the rewriting sums up the contributions of each tuple
to the lower and upper bounds, and projects out the emplKey attribute. Notice that
as in the rewriting for query q5 of the previous example, we have a condition having
sum(bottomCount)>0. This is done because, again, we want to report only the ages that
appear for sure in every repair.
select age,
sum(bottomSalary) as glbSalary,
sum(topSalary) as lubSalary
from
( select * from contribConsistentSubQuery
union all
select * from contribNonConsistentSubQuery ) q
group by age
having sum(bottomCount)>0
6.4 Exploiting Precomputed Annotations
The main focus of the thesis is on query processing directly on the inconsistent database.
However, in some circumstances, it may be advantageous to process the database offline
in order to materialize data structures with information about constraint violations. This
Chapter 6. ConQuer: System Implementation and SQL Rewritings 135
precomputed data could then be exploited during online query answering to improve the
performance of the queries.
In this section, we will present a simple offline precomputation scheme, and show the
rewritings that ConQuer produces in order to exploit it. The scheme is based on annota-
tions attached to each tuple. The annotation consists of just one bit that states whether
the tuple satisfies or violates a given key constraint. If annotation are present, then
ConQuer can produce a rewriting that exploits them. We call such rewriting annotation-
aware. In the next example, we illustrate the annotation-aware rewritings. In the next
section, we will identify the scenarios where it is desirable to exploit the annotations, and
we will empirically validate the effectiveness of the annotation-aware rewritings.
Example 6.8. Let R be a schema with relations employee(emplKey, deptFKey) and
dept(deptKey,mgrName). We will give an example based on a SPJ query without ag-
gregation. However, the example shows all the ingredients of the rewritings on annotated
databases, and extending the rewriting to the case of rewritings for queries with aggre-
gation is straightforward.
Consider a SQL query q7 that retrieves the names of all employees whose department
manager is Peter:
q7: select distinct emplKey
from employee,dept
where employee.deptFKey= dept.deptKey and dept.mgrName=‘‘Peter’’
Consider the database I = {employee(John, Sales), employee(Mary,Engineering),
dept(Sales, Peter), dept(Sales, Tom), dept(Engineering, Peter)}. Suppose that we in-
struct ConQuer to process the database offline and annotate each tuple with a bit stating
whether it satisfies or violates the constraints of Σ. Assume that ConQuer augments the
set of attributes of each relation with an attribute called cons that stores the annotation.
The “annotated database” produced by ConQuer would then be the following.
employee dept
emplKey deptFKey cons deptKey mgrName cons
John Sales y Sales Peter n
Mary Engineering y Sales Tom n
Engineering Peter y
Chapter 6. ConQuer: System Implementation and SQL Rewritings 136
Note that the tuple for Mary in relation employee, and the tuple for Engineering in
relation dept have a value of ‘‘y’’ in their cons attributes, meaning that they do not
violate any constraint. If we join these tuples, we get a tuple that satisfies query q7.
Furthermore, it is easy to see that this will be the only tuple in the result for Mary.
Thus, it must be a consistent answer.
In general, the join of consistent tuples (i.e, tuples where cons = ‘‘y’’) produces
a consistent answer. For such tuples, it suffices to check whether the conditions of the
original query are satisfied (in this example, check that they satisfy q7). In this way, we
can avoid the possibly costly operations of the rewritings produced by the algorithms
RewriteForestSQL and RewriteAggSQL. In the rewriting, we capture these tuples in a
subquery called allConsistentSubQuery (allConsistent because they come from the
join of tuples all of which are consistent). The subquery consists of the input query and a
filter that requires every tuple in the join to have a value of ‘‘y’’ in the cons attribute.
with allConsistentSubQuery as (
select distinct emplKey
from employee,dept
where employee.deptFKey= dept.deptKey and dept.mgrName=‘‘Peter’’
and employee.cons=‘‘y’’ and dept.cons=‘‘y’’
Now, note that the tuple for John also satisfies the constraints and has a value of
‘‘y’’ in its cons attribute. However, this tuple joins with the tuples for the Sales
department, which violate the key constraint of their relation (they are annotated with
‘‘n’’). If we join the tuple for John with the tuple dept(Sales, Peter), the result satisfies
q7. But if we join with dept(Sales, Tom), the result does not satisfy the query. Thus,
John is not a consistent answer to q7.
To keep track of the join of tuples that may violate a constraint, we produce a rewrit-
ing that is similar to the one that would be produced by RewriteForestSQL, the only
difference being that we augment the candidatesSubQuery subquery of the rewriting
with a condition checking whether the cons attribute of at least one of the joined tu-
ples is set to ‘‘n’’. In our example, we check the condition employee.cons=‘‘n’’ or
dept.cons=‘‘n’’. The result obtained from these tuples is kept in a subquery called
someNonConsistentSubQuery (the name comes from the fact that some of the tuples of
the join may not be consistent).
Chapter 6. ConQuer: System Implementation and SQL Rewritings 137
with candidatesSubQuery (
select distinct emplKey
from employee,dept
where employee.deptFKey= dept.deptKey and dept.mgrName=‘‘Peter’’
and (employee.cons=‘‘n’’ or dept.cons=‘‘n’’) )
with countViolSubQuery as (
select emplKey,
sum(case
when employee.deptFKey=dept.emplKey
and dept.mgrName=‘‘Peter’’ then 0 else 1 end)
as countViol
from employee left outer join dept
on employee.deptFKey=dept.emplKey
where exists (select * from Candidates C where C.emplKey=employee.emplKey)
group by emplKey )
with someNonConsistentSubQuery as (
select emplKey
from countViolSubQuery
where countViol = 0)
Finally, the main body of the query takes the union of the tuples obtained with the
subqueries allConsistentSubQuery and someNonConsistentSubQuery.
select emplKey from
(select emplKeyfrom someNonConsistentSubQuery)
union all
(select emplKeyfrom someNonConsistentSubQuery)
Notice that this rewriting is correct even if annotations incorrectly mark a consistent
tuple as inconsistent. Hence, when deleting or updating a tuple, it is not mandatory to
update annotations.
Chapter 6. ConQuer: System Implementation and SQL Rewritings 138
6.5 Related Work
In this section, we review systems for managing inconsistent databases that are related
to ConQuer. Hippo [CMS04b, CMS04a] is a system that produces consistent answers
for unions of quantifier-free conjunctive queries (that is, unions of queries in the class
presented by Arenas, Bertossi, and Chomicki [ABC99]). Hippo does not consider queries
with aggregation, grouping or bag semantics. Apart from the class of queries that it can
handle, Hippo differs from ConQuer in the fact that it is not based on query rewriting.
Rather, Hippo takes the more procedural approach of producing a Java program which
computes the consistent answers. Although the program does interact with an RDBMS
back-end, most of the processing is done by processing an (in-memory) conflict graph
data structure that contains all the tuples that violate the constraints. The system may
not be able to operate on databases where this data structure does not fit in memory.
Hippo has been shown to scale to database of up to 300,000 tuples [CMS04b].
There are a number of systems for consistent query answering that rewrite queries into
powerful logics [CB00, LLR02, EFGL03, CB05]. Infomix [EFGL03] is a notable example
of such an approach. In Infomix, queries are rewritten into disjunctive logic programs.
Such programs are computationally more expensive than SQL, but also more expressive
and permit rewritings over a very rich class of query constraints. For example, Infomix
considers general functional, inclusion, and exclusion query constraints. These systems
focus on expressiveness, more than efficiency and scalability, and therefore address a
different design point than the one we are considering. To give an idea of the scale of
the difference, one of the few experimental studies available in the literature [EFGL03]
reports results for databases with at most 100 tuples violating primary key constraints
(over a database of 50,000 tuples). In contrast, the largest database that we used in the
experiments reported in the next chapter has 8.6 million inconsistent tuples (over a total
of 172 million tuples).
Chapter 7
Experimental Analysis
In this chapter, we validate the efficiency of ConQuer’s rewritings using IBM DB2 UDB
Version 8.2 (from now on, referred to as just DB2). In Section 7.1, we give a detailed
description of the experimental framework. Then, in Section 7.2, we report and analyze
the experimental results obtained within this framework.
7.1 Experimental Framework
7.1.1 System and Database Manager Configuration
The experiments were performed on a Sun v40z server class computer with 4 processors
and 8 GB of RAM, running RedHat Linux AS 4 kernel Version 2.6.9. The relational
database management system used to run the queries was IBM DB2 UDB Version 8.2.
We now describe some important parameters in the database configuration. The
buffer pool size was deliberately kept considerably below the system’s available memory.
This is because our aim is to test the overhead of the queries in environments where the
amount of primary memory is small compared to database size. In particular, the buffer
pool size was restricted to 400 MB (whereas the size of the largest database reported
here is 20 GB).
In order to reduce the number of variables to consider when comparing running
times, the query optimizer was set to use a degree of intra-parallelism (parameter DFT-
DEGREE) of 1, meaning that the query plan always chooses to use one processor, even
though there are four available in the system. The query optimization level, which dic-
tates the amount of time that the query optimizer may spend to produce a query plan,
139
Chapter 7. Experimental Analysis 140
was set to its highest value (parameter DFT QUERYOPT was set to 9) since the time to
produce a plan is always negligible with respect to the time to execute the fairly complex
queries that we use in our experiments.
For all databases, statistics were created by running the DB2 RUNSTATS command.
The parameters for statistics gathering were set as follows: the number of “most frequent”
values to be collected from each table (parameter NUM FREQVALUES) was set to 10;
and the number of quantiles for the distributions (parameter NUM QUANTILES) was
set to 20.
We created clustered indices for the (potentially violated) primary key attributes.
Notice that these indices cannot be declared as “unique” since the database may be
inconsistent. With respect to the annotations introduced in Section 6.4, we added an
attribute called cons to each table, and used it to keep track of whether each tuple satisfies
or violates the primary key constraints. For each relation, we declared a secondary index
on the attributes of the key plus the cons attribute. The values for the cons attributes
are computed offline. However, it is important to point out that in the experimental
results that we report here, this attribute is used only where we explicitly say that the
rewritings are annotation-aware. By default, we assume that the rewritings work on the
inconsistent database without exploiting precomputed information.
Regarding the indices of the database, we considered a worst-case and a typical sce-
nario. In the worst-case scenario, the only indices in the database are those for the key
attributes and the annotations. We also considered a more typical scenario, where several
indices are declared. In particular, we created all indices suggested by DB2’s Configu-
ration Advisor. In each database, the size of the indices proposed by the Configuration
Advisor corresponds to a third of the size of the database. The indices are shown in
Appendix B.
7.1.2 Inconsistent Database Instances
For the inconsistent databases, we employed the schema and data of TPC-H, the standard
benchmark for decision support systems. The schema is shown in Figure 7.1. The sizes
of the tables are also shown in Figure 7.1 (under their names), and are given in number
of tuples for a 1 GB instance. For example, the relation lineitem has 6 million tuples on
a 1 GB instance. As per the TPC-H standard, all tables except nation and region are
scaled proportionally to the size of the database (this is indicated with SF in the figure).
Chapter 7. Experimental Analysis 141
Figure 7.1: Schema specified in the TPC-H standard (taken from [TPC03])
Chapter 7. Experimental Analysis 142
The parameters used to build the databases are the following:
• The size s of the database. We considered databases of various sizes, up to 20
GB (172 million tuples). Notice that this size is 50 times larger than the size of the
buffer pool of the database (whose size is 400 MB).
• The percentage p of the database that is inconsistent. For example on a 1 GB
instance (8.6 million tuples) where p is 25%, there are 2.15 million tuples that
violate the key constraints of the schema. We created the databases in such a way
that every relation has the same value of p as the entire database. We experimented
with values of p ranging from 0% (totally consistent database) to 25%.
• The number of tuples n that share a common key value (and hence violate a key
constraint), for every key value in the inconsistent portion of the database. For
example, if n = 2, then every key value in the inconsistent portion of the database
appears in exactly two tuples. The value is fixed for every tuple of the inconsistent
portion (i.e., every key value of the database appears exactly one or n times). We
experimented with values of n ranging from 2 to 7.
The TPC Consortium provides a data generator called dbgen that produces database
instances compliant with the standard.1 Since the TPC-H standard does not consider
inconsistent databases, dbgen creates instances that do not violate the primary key con-
straints of the schema. For this reason, we modified the source code of dbgen in order to
produce a generator that creates inconsistent databases. The database generator creates
each table as follows. Let l be total number of tuples to be generated in the table. First,
we generate l.(1− p100
+ p100n
) tuples. Second, we randomly select l.p100.n
tuples from them.
Third, for each selected tuple ~t, we generate n−1 additional tuples by invoking the tuple
generation functions of dbgen. We replace the key values of the n − 1 generated tuples
with the key value of ~t.
7.1.3 Workload
The experiments were performed using queries specified in the TPC-H standard. There
are twenty two queries in the standard, twelve of which are aggregate conjunctive queries,
the type of queries that we handle in this work. The other ten queries have features
1The database generator can be obtained from the TPC Consortium’s website at http://www.tpc.org
Chapter 7. Experimental Analysis 143
that are beyond aggregate conjunctive queries, such as aggregation in nested subqueries
(Queries 2, 11, 15, 17, 18 and 20 of the specification), left outer joins (Query 13), and
negation (Queries 16, 21, and 22).
In our experiments, we will focus on eleven queries from the TPC-H specification
(Queries 1, 3, 4, 6, 7, 8, 9, 10, 12, 14, and 19). The original TPC-H queries together with
their rewritings are given in Appendix A. Notice that, of the twelve aggregate conjunctive
queries, we rule out only one query. This is Query 5 of the specification, which contains
a nonkey-to-nonkey join, which we cannot handle with our query rewriting algorithm.
(Following the results of Chapter 5, Query 5 is in class C∗ and thus has no query rewriting
into SQL). Of the eleven queries that we consider, six are strictly in class Csqlaggforest
(Queries 3, 4, 6, 9, 10, and 12), and the other five can be handled with our rewriting
algorithm RewriteAggSQL with little or no modification for the following reasons. First,
Queries 7 and 8 have repeated relation symbols that appear at leaf nodes of the join
graph. The algorithm RewriteAggSQL can handle this case, since the nonkey variables of
these repeated relation symbols are not involved in any join. Second, Queries 7 and 19
have disjunction involving equalities of attributes to constants. We showed in Chapter
3 that it is quite easy to extend the algorithm that produces a first-order rewriting to
handle this case, and the SQL rewriting algorithm RewriteAggSQL of this chapter can
be used for such cases without modification (the disjunction is considered part of the
selection conditions in the expression CONDS of Figures 6.8). Finally, Queries 8, and
14 perform an arithmetic operation (division) on the result of two aggregate operators,
and Query 1 computes an average. In such cases, we give bounds that are sound, but
not tight.2
In Figure 7.2, we summarize the main characteristics of the eleven queries used in
the experiments. For each query, we give the number of relations in the from clause,
the number of selection conditions in the where clause (this excludes join conditions),
the selectivity (as the percentage of joined tuples that satisfy the selection conditions of
the query), the number of projecting attributes in the select clause, and the number of
aggregate functions in the select clause. The queries in the TPC-H specification are pa-
rameterized, and the standard suggests values for these parameters. In the experiments,
we used the suggested values in all the queries. The selectivities reported in Figure 7.2
are based on these parameters.
2For the queries with the sum operator, all ranges are tight since the queries in the TPC-H standardonly aggregate over attributes with positive values.
In this section, we report the results of the experiments that we performed in order to
quantify the overhead of the rewritings produced by ConQuer.
7.2.1 Scalability
In this subsection, we study the scalability of ConQuer’s approach. In particular, we
show the effect of the size of the inconsistent databases on the overhead of the rewritten
queries. In Figure 7.3, we report the overhead of the eleven rewritten queries on a number
of databases where we fix the degree of inconsistency to 5% of the database (p = 5%), and
2 conflicts per inconsistent key value (n = 2). The size of the databases (reported on the
x-axis) ranges from 1 GB to 20 GB (that is, from 8.6 million tuples to 172 million tuples).
The databases are generated independently of each other, and correspond to the scenario
where indices are created only for the key attributes. On the y-axis, we report the
overhead of the rewritten queries, computed as the ratio between the running time of
the rewritten query over the running time of the original (non-rewritten) query. The
rewritings reported in the figure do not exploit annotations (i.e., they are unaware of
annotations, if any, computed as explained in Section 6.4).
Chapter 7. Experimental Analysis 145
For presentation purposes, we split the queries into three graphs. The queries are
grouped based on the behaviour of the overhead as the size of the databases increases.
The graph at the top shows queries where the overhead initially increases, but then
remains constant or decreases (Queries 1, 7, 12, 14). The graph in the middle shows
queries where the overhead increases monotonically with the size of the database (Queries
3, 8, 10). The rest of the queries are shown in the graph at the bottom (Queries 4, 6, 9,
19).
We identified two factors that have a significant impact on the overhead of the rewrit-
ings: the selectivity of the original queries, and the query plans chosen by DB2’s opti-
mizer. Let us start with the selectivity of the queries. To understand their effect, recall
that in the SQL rewriting algorithm RewriteAggSQL of Figure 6.8, there is a subquery
called candidatesSubQuery that is designed to exploit the selectivity of the original
queries. In particular, this subquery returns only the values for the root-key attributes
that satisfy the conditions of the original query. More specifically, let q be a query,
K1, . . . , Kn be the attributes that appear at some root of the join graph of q, and CONDSbe the selection conditions of q. Then, the rewriting produced by RewriteAggSQL(q, Σ)
has a subquery of the following form:
with candidatesSubQuery as (
select K1 as cK1,. . . ,Kn as cKn
from <list of relations in q>
where CONDS )
Clearly, the lower the selectivity of the original query q, the fewer tuples are returned
by candidatesSubQuery. The rest of the rewriting operates on the result of the following
CREATE SUMMARY TABLE "DB2ADMIN"."MQT609161518000000" AS (SELECT Q4.C0 AS "C0", Q4.C1 AS "C1",
Q4.C2 AS "C2", Q4.C5 AS "C3", Q4.C4 AS "C4", Q4.C3 AS "C5", Q4.C6 AS "C6"
FROM TABLE(SELECT Q3.C0 AS "C0", SUM(Q3.C1) AS "C1", SUM(Q3.C2) AS "C2", Q3.C5 AS "C3", Q3.C4 AS "C4", Q3.C3 AS "C5", COUNT(* ) AS "C6" FROM TABLE(SELECT Q1.L_SHIPMODE AS "C0", CASE WHEN ((Q2.O_ORDERPRIORITY = ’1-URGENT ’) OR (Q2.O_ORDERPRIORITY = ’2-HIGH ’)) THEN 1 ELSE 0 END AS "C1", CASE WHEN ((Q2.O_ORDERPRIORITY <> ’1-URGENT ’) AND (Q2.O_ORDERPRIORITY <> ’2-HIGH ’)) THEN 1 ELSE 0 END AS "C2", Q1.L_RECEIPTDATE AS "C3", Q1.L_SHIPDATE AS "C4", Q1.L_COMMITDATE AS "C5" FROM DB2ADMIN.LINEITEM AS Q1, DB2ADMIN.ORDERS AS Q2 WHERE (Q2.O_ORDERKEY = Q1.L_ORDERKEY)) AS Q3 GROUP BY Q3.C3, Q3.C4, Q3.C5, Q3.C0) AS Q4) DATA INITIALLY DEFERRED REFRESH IMMEDIATE IN USERSPACE1 ;
-- index[1], 990.099MB
CREATE INDEX "DB2ADMIN"."IDX609161532510000" ON "DB2ADMIN"."LINEITEM" ("L_PARTKEY" ASC,