This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
– Static analysis: consistency and implication, axiom system Conditional inclusion dependencies (CINDs; Chapter 3)
– Syntax and semantics
– Static analysis: consistency and implication Matching dependencies for record matching (MDs; Chapter 4)
– Syntax and semantics
– Relative candidate keys
TDD: Topics in Distributed Databases
2
Characterizing the consistency of data
One of the central technical problems for data consistency is how
to tell whether the data is dirty or clean
Integrity constraints (data dependencies) as data quality rules
Inconsistencies emerge as violations of constraints
Traditional dependencies: – functional dependencies– inclusion dependencies– denial constraints (a special case of full dependencies)– . . .
Question: are these traditional dependencies sufficient?
3
Example: customer relation
Schema: Cust(country, area-code, phone, street, city, zip)
Instance:
country area-code phone street city zip
44 131 1234567 Mayfield NYC EH4 8LE
44 131 3456789 Crichton NYC EH4 8LE
01 908 3456789 Mountain Ave NYC 07974
functional dependencies (FDs):
cust[country, area-code, phone] cust[street, city, zip]
cust[country, area-code] cust[city]
The database satisfies the FDs. Is the data consistent?
4
Capturing inconsistencies in the data
cust ([country = 44, zip] [street])
In the UK, zip code uniquely determines the street
The constraint may not hold for other countries It expresses a fundamental part of the semantics of the data It can NOT be expressed as a traditional FD
– It does not hold on the entire relation; instead, it holds on
cust([country, area-code, phone] [street, city, zip])
as a SINGLE CFD: (cust(country, area-code, phone street, city, zip), Tp) pattern tableau Tp: one tuple for each constraint
Example CFDs
country area-code phone street city zip
44 131 _ _ Edi _
01 908 _ _ MH _
_ _ _ _ _ _
CFDs subsume traditional FDs. Why?
9
Express
cust[country, area-code] cust[city]
as a CFD: (cust(country, area-code, city), Tp) pattern tableau Tp: a single tuple consisting of _ only
Traditional FDs as a special case
country area-code city
_ _ _
10
a b (a matches b) if – either a or b is _ – both a and b are constants and a = b
tuple t1 matches t2: t1 t2
(a, b) (a, _), but (a, b) does not match (a, c)
DB satisfies (R: X Y, Tp) iff for any tuple tp in the pattern tableau Tp and for any tuples t1, t2 in DB, if t1[X] = t2[X] tp[X], then t1[Y] = t2[Y] tp[Y]– tp[X]: identifying the set of tuples on which the constraint tp
applies, ie, { t | t[X] tp[X]} – t1[Y] = t2[Y] tp[Y]: enforcing the embedded FD, and the
pattern of tp
Semantics of CFDs
11
cust([country = 44, zip] [street])
Tuples t1 and t2 violate the CFD t1[country, zip] = t2[country, zip] tp[country, zip] t1[street] t2[street]
The CFD applies to t1 and t2 since they match tp[country, zip]
Example: violation of CFDs
id country area-code phone street city zip
t1 44 131 1234567 Mayfield NYC EH8 8LE
t2 44 131 3456789 Crichton NYC EH8 8LE
t3 01 908 3456789 Mountain Ave NYC 07974
country zip street
44 _ _
CFDs: enforcing binding of semantically related data values
12
(cust(country, area-code city), Tp)
Tuple t1 does not satisfy the CFD t1[country, area-code] = t1[country, area-code] tp1[country, area-code] t1[city] = t1[city]; however, t1[city] does not match tp1[city]
In contrast to traditional FDs, a single tuple may violate a CFD
Violation of CFDs by a single tuple
id country area-code city
tp1 44 131 Edi
tp2 01 908 MH
tp3 _ _ _
id country area-code phone street city zip
t1 44 131 1234567 Mayfield NYC EH8 8LE
t2 44 131 3456789 Crichton NYC EH8 8LE
t3 01 908 3456789 Mountain Ave NYC 07974
13
(cust(country, area-code, phno street, city, zip), Tp)
Tuple t3 violates the CFD. Why? Tuples t1 and t4 violate the CFD. Why?
Exercise
id country area-code phone street city zip
t1 44 131 1234567 Mayfield Edi EH4 8LE
t2 44 131 3456789 Mayfield NYC 19082
t3 01 908 3456789 Mountain Ave NYC 19082
t4 44 131 1234567 Chrichton EDI EH8 9LE
id country area-code phon street city zip
tp1 44 131 _ _ Edi _
tp2 01 908 _ _ MH _
tp3 _ _ _ _ _ _
14
“Dirty” constraints?
A set of CFDs may be inconsistent!
Inconsistent: (R(A B), Tp)
In any nonempty database DB and for any tuple t in DB,– tp1: t[B] must be b– tp2: t[B] must be c– Inconsistent if b and c are different
Two instances are needed to cope with the dynamic semantics
(D1, D2) satisfies iff for all (t1, t2) D1, if t1[A1] 1 t2[B1] . . . t1[Ak] k t2[Bk] in D1
– then (t1, t2) D2, and t1[Z1] = t2[Z2] in D2
If (t1, t2) match the LHS, then their RHS are updated and equalized
phn post …
3256777 PO Box 25, EDI
tel address …
3256777 10 Oak St, EDI
phn post …
3256777 10 Oak St, EDI, EH8 9LE
tel address …
3256777 10 Oak St, EDI, EH8 9LE
D1 D2
Different from FDs?
5151
An extension of functional dependencies (FDs)?
A departure from traditional dependency theory
tel address …
3256777 10 Oak St, EDI
3256777 PO Box 25, EDI
tel address …
3256777 10 Oak St, EDI, EH8 9LE
3256777 10 Oak St, EDI, EH8 9LE
D1 D2
similarity operators vs. equality (=) only across different relations (R1, R2) vs. on a single relation dynamic semantic (matching operator ) vs. static semantics
FD: tel address
violationof the FD satisfying
the MD
to accommodate unreliable data
MD: (R1[A1] 1 R2[B1] . . . R1[Ak] k R2[Bk]) R1[Z1] R2[Z2]developed for schema design for “clean” data
5252
Deduction of new MDs from given MDs
Given a set of MDs and a single , can be deduced from ?
Different from our familiar notion of implication
For all (D1, D2) if (D1, D2) satisfies and (D2, D2) satisfies
then (D1, D2) satisfies D1 D2
“fixpoint” reached by enforcing
No matter how is interpreted, if is enforced, so is
A departure from candidate keys: similarity, different sources
Ultimate goal: to decide whether R1[X] and R2[Y] refer to the same objectrelative to R1[X]
and R2[Y]
what to compare and how to compare
5555
What is special about RCKs?
The match quality is highly dependent on the choices of keys
only records in the same block are
compared
– windowing (sorted neighborhood)
D B2B1
B3 discriminating
attributes
D D sortingvia keys
slidingwindow
window of a fixed size; only records in the same window are
compared;
Matching rules: identify records from unreliable data sources
Optimization: efficiency is a big issue for record matching– blocking
RCKs can be deduced from matching dependencies
56
Summary and review
What are CFDs? CINDs? Why do we need new constraints?
What is the consistency problem? Complexity?
What is the implication problem? Inference system? Sound and
complete?
What is record matching? Why bother?
What are matching rules?
Why matching dependencies and relative candidate keys? Why are they
different from functional dependencies and relational keys? What are blocking and windowing?
Problems to think about:
– CFDs across different relations? With numeric values?
– Negative rules for MDs: if condition then NO match
– Conditions for MDs: if then match
57
Project (1)
Given a relational database D and a set S of data quality rules, error detection is to find all pairs of tuples in D that violate at least one dependency in S. Develop two MapReduce algorithms for error detection, when S consists of
Prove the correctness of your algorithms, give their complexity analyses, and show that they are scalable with both D and SExperimentally evaluate your algorithms, by randomly generating dependencies in S and large datasets D
57
A development project
58
Project (2)
Given a relation D and a set S of matching dependencies, record matching is to identify all pairs of tuples in D based on dependencies in S. For matching dependencies, read
W. Fan, J. Li, X. Jia, and S. Ma. Dynamic constraints for record matching, VLDB, 2009.
Develop a MapReduce algorithm for record matching. Prove the correctness of your algorithm, give its complexity
analyses, and show that it is scalable with both D and S Develop optimization techniques to improve its response time Experimentally evaluate your algorithm, by randomly generating
dependencies in S and large datasets D
Development project
59
Project (3)
Write a survey on integrity constraints studied for improving data quality, including
• Traditional functional and inclusion dependencies• Conditional functional and inclusion dependencies• Denial constraints, and• at least two other forms of constraints
Develop a set of criteria for evaluating these data quality rules, in terms of expressive power, complexity, and practical impact, etc.Critically evaluate each of these classes based on your criteriaPropose possible extensions to these constraints as data quality rules for big data.
59A survey project
Reading for the next weekhttp://homepages.inf.ed.ac.uk/wenfei/publication.html
60
1. P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification, SIGMOD 2005.
2. W. Fan, F. Geerts, S. Ma, and H. Muller, Detecting inconsistencies in distributed data, ICDE 2010.
3. W. Fan, J. LI, N. Tang, and W. Yu. Incremental detection of inconsistencies in distributed data, TKDE 26(6). 2014
4. W. Fan, J. Li, S. Ma, N. Tang and W. Yu, Interaction between Record Matching and Data Repairing, SIGMOD, 2011.
5. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, Towards certain fixes with editing rules and master data, VLDB 2010.
6. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB 2011. https://cs.uwaterloo.ca/~ilyas/papers/YakoutVLDB2011.pdf