1 Schema Refinement and Normal Forms UMass Amherst Feb 14, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke, Dan Suciu
1
Schema Refinement and Normal Forms
UMass AmherstFeb 14, 2007
Slides Courtesy of R. Ramakrishnan and J. Gehrke, Dan Suciu
2
Relational Schema Design
PersonbuysProduct
name
price name ssn
ER Model
Relational Schemaplus Integrity Constraints
Normalized schema
Conceptual Design
Logical design
SchemaRefinement
3
The Evils of Redundancy
Redundancy is at the root of several problems associated with relational schemas: redundant storage insert anomaly delete anomaly update anomaly
Integrity constraints, in particular functional dependencies, can be used to identify schemas with such problems and to suggest refinements.
4
Schema Refinement
Main refinement technique: decomposition E.g., replacing ABCD with AB and BCD, or ACD and
ABD.
Decomposition should be used judiciously: Is there reason to decompose a relation? Theory on
normal forms. What problems (if any) does the decomposition cause?
Properties of decomposition include lossless-join and dependency-preserving.
Decomposition can cause performance problems.
5
Functional Dependencies
Functional Dependency:
A1, A2, …, An B1, B2, …, Bm
Meaning:
If two tuples agree on the attributes
then they must also agree on the attributes
A1, A2, …, An
B1, B2, …, Bm
Table R(.... A1, A2, …, An… B1, B2, …, Bm … )
Example
ssn name lot rating wages hours
123-22-3666 Attishoo 48 8 10 40
231-31-5386 Smiley 22 8 10 30
131-24-3650 Smethurst 35 5 7 30
434-26-3751 Guldu 35 5 7 32
612-67-4134 Madayan 35 8 10 40
Hourly_Emps (ssn, name, lot, rating, hrly_wages, hrs_worked)
Notation: We will denote this relation schema by listing the attributes: SNLRWH
� This is really the set of attributes {S,N,L,R,W,H}.
Some FDs on Hourly_Emps:� S SNLRWH � rating determines hrly_wages: R W
8
Functional Dependencies (FDs)
A functional dependency X Y holds over relation R if ∀ allowable instance r of R: t1 r, t2 r, (t1) = (t2) implies (t1) = (t2), X and Y are sets of attributes.
An FD is a statement about all allowable relations. Must be identified based on semantics of application. Given an allowable instance r1 of R, we can check if r1
violates some FD f, but we cannot tell if f holds over R! A superkey is a set of attributes K such that K B
for all attributes B. A key is a minimal superkey
Example (Contd.)
Problems due to R W : Redundant storage Update anomaly: Can
we change W in just the 1st tuple of SNLRWH?
Insertion anomaly: What if we want to insert an employee and don’t know the hourly wage for his rating?
Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5!
Will 2 smaller tables be better?
S N L R W H
123-22-3666
Attishoo 48 8 10 40
231-31-5386
Smiley 22 8 10 30
131-24-3650
Smethurst 35 5 7 30
434-26-3751
Guldu 35 5 7 32
612-67-4134
Madayan 35 8 10 40
Hourly_Emps2Wages
R W8 10
5 7
S N L R H
123-22-3666
Attishoo 48 8 40
231-31-5386
Smiley 22 8 30
131-24-3650
Smethurst 35 5 30
434-26-3751
Guldu 35 5 32
612-67-4134
Madayan 35 8 40
10
Reasoning About FDs
Given some FDs, we can usually infer additional FDs: ssn did, did lot implies ssn lot
An FD f is implied by a set of FDs F, if f holds for every reln instance that satisfies all FDs in F. F+ = Closure of F is the set of all FDs that are implied by F.
Armstrong’s Axioms (X, Y, Z are sets of attributes): Reflexivity: If X ⊆ Y, then Y X Augmentation: If X Y, then XZ YZ for any Z Transitivity: If X Y and Y Z, then X Z
11
Reasoning About FDs (Contd.)
Additional rules (that follow from AA): Union: If X Y and X Z, then X YZ Decomposition: If X YZ, then X Y and X Z
These are sound and complete inference rules for FDs! Soundness: when applied to a set F of FDs, the axioms
generate only FDs in F+. Completeness: repeated application of these axioms will
generate all FDs in F+.
Example (continued)
From:1. name color2. category department3. color, category price
Inferred FD Which Ruledid we apply ?
4. name, category name5. name, category color6. name, category category7. name, category color, category8. name, category price
name, category priceTo:
ReflexivityTransitivity on 4, 1ReflexivityUnion on 5, 6Transitivity on 3, 7
13
Reasoning About FDs (Contd.)
Computing the closure F+ can be expensive: computes for all FD’s; size of closure is exponential in # attrs!
Typically, we just want to check if a given FD X Y is in F+. An efficient check: Compute attribute closure of X (denoted X+) w.r.t. F, i.e., the
largest attribute set A such that X A is in F+. Simple algorithm: DO if there is U V in F s.t. U ⊆ X+, then X+= X+∪V UNTIL no change Check if Y is in X+.
Does F = {A B, B C, C D E } imply A E? i.e., is A E in the closure F+? Equivalently, is E in A+?
Computing Keys
Compute X+ for all sets X If X+ = all attributes, then X is a superkey Consider only the minimal superkeys
Enrollment(student, address, course, room, time)student addressroom, time coursestudent, course room, time
Please compute all keys.
15
Normal Forms
Returning to the issue of schema refinement, the first question to ask is whether any refinement is needed!
If a relation is in a certain normal form (BCNF, 3NF etc.), it is known that certain redundancy related problems are avoided/minimized. This helps us decide if decomposing the relation will help.
Role of FDs in detecting redundancy: Consider a relation R with 3 attributes, ABC.
• No FDs hold: There is no redundancy here.• Given A B: Several tuples could have the same A
value, and if so, they’ll all have the same B value!
16
Boyce-Codd Normal Form (BCNF) Given a relation R, and set of FD’s F on R R is in BCNF if:
For each FD X A, one of the following is true:• A ∈ X (called a trivial FD), or• X is a superkey (i.e., contains a key) for R.
“The only non-trivial FDs that hold over R are key constraints.”
Equivalently: for any set of attributes X, either X+ = X or X+ = all attributes
Example
Is the following table in BCNF? R(A,B,C,D) FDs: B AD
Key is BC, so B is not a superkey Not in BCNF
17
18
Boyce-Codd Normal Form (BCNF)
Can we infer the value marked by ‘?’ ? Is the relation in BCNF?
If a reln is in BCNF, every field of every tuple records a piece of information that can’t be inferred (using only FD’s) from values in other fields.
BCNF ensures that no redundancy can be detected using FDs!
Third Normal Form (3NF)
R is in 3NF if: For each X A one of the following is true:
• A ∈ X (called a trivial FD), or• X is a superkey for R, or• A is part of some key for R.
Minimality of a key is crucial in third condition above! If R is in BCNF, obviously in 3NF. If R is in 3NF, some redundancy is possible.
E.g., Reserves {Sailor, Boat, Date, Credit_card} with S C, C S is in 3NF. But for each reservation of sailor S, same (S, C) pair is stored.
Lossless-join, dependency-preserving decomposition of R into a collection of 3NF relations always possible. (not true for BCNF!)
20
Hierarchy of Normal Forms 1st normal form 1NF: no set-valued attributes. 2nd normal form 2NF: [historical interest only] 3rd normal form 3NF Boyce-Codd normal form BCNF: 3NF, and no
XA s.t. A is part of a key. No redundancy detected by FDs.
4th normal form 4NF: BCNF and no multi-valued dependencies (MVD). No redundancy detected by FDs and MVD. We won’t discuss in detail in this class.
21
Decomposition of a Relation Scheme
A decomposition of R replaces R by two or more relations such that: Each new relation scheme contains a subset of the attributes
of R (and no attributes that do not appear in R), and Every attribute of R appears as an attribute of at least one
new relation. As a result, we will store instances of the relation
schemes produced by the decomposition, instead of instances of R.
22
Decompositions in General
R1 = projection of R on A1, ..., An, B1, ..., Bm R2 = projection of R on A1, ..., An, C1, ..., Cp
R(A1, ..., An, B1, ..., Bm, C1, ..., Cp)
R1(A1, ..., An, B1, ..., Bm) R2(A1, ..., An, C1, ..., Cp)
23
Example Decomposition
Decompositions should be used only when needed. SNLRWH has FDs S SNLRWH and R W. R W causes violation of 3NF; W values repeatedly
associated with R values. Easiest way to fix this is to create a relation RW to store
these associations, and to remove W from the main schema: • i.e., we decompose SNLRWH into SNLRH and RW.
If we just store the projections of SNLRWH tuples onto SNLRH and RW, are there any potential problems that we should be aware of?
24
Problems with Decompositions Three potential problems to consider:
1. Some queries become more expensive. • e.g., How much did sailor Joe earn? (salary = W*H)
2. Given instances of the decomposed relations, we may not be able to reconstruct the corresponding instance of the original relation! • Fortunately, not in the SNLRWH example.
3. Checking some dependencies may require joining the instances of the decomposed relations.• Fortunately, not in the SNLRWH example.
Tradeoff: Must consider these issues vs. redundancy.
25
Lossless Join Decompositions
Decomposition of R into X and Y is lossless-join w.r.t. a set of FDs F if ∀ instance r that satisfies F: (r) (r) = r
It is always true that r (r) (r) In general, the other direction does not hold! If it does, the
decomposition is lossless-join.
Definition extended to decomposition into 3 or more relations in a straightforward way.
It is essential that all decompositions used to deal with redundancy be lossless! (Avoids Problem (2).)
26
More on Lossless Join
The decomposition of R into X and Y is lossless-join wrt F if and only if the closure of F contains: X ∩ Y X, or X ∩ Y Y
In particular, if U V holds over R, the decomposition of R into UV and R - V is lossless-join.
27
Dependency Preserving Decomposition
Consider CSJDPQV, C is key, JP C and SD P. BCNF decomposition: CSJDQV and SDP Problem: Checking JP C requires a join!
Dependency preserving decomposition: Intuitively, if R is decomposed into X and Y, and we enforce
the FDs that hold on X and Y respectively, all FDs that were given to hold on R must also hold. (Avoids Problem (3).)
Projection of set of FDs F: If R is decomposed into X, ..., projection of F onto X
(denoted FX ) is the set of FDs U V in closure F+ such that U, V are both in X.
28
Dependency Preserving Decompositions (Contd.)
Formally, decomposition of R into X and Y is dependency preserving if (FX union FY ) + = F +
Important to consider F +, not F, in this definition: ABC, A B, B C, C A, decomposed into AB and BC. Is this dependency preserving? Is C A preserved?
Dependency preserving does not imply lossless join: ABC, A B, decomposed into AB and BC.
And vice-versa! (Example?)
29
Decomposition into BCNF
Consider relation R with FDs F. If X Y violates BCNF, decompose R into XY and R - Y. Repeated application of this idea will give us a collection of
relations that are in BCNF; lossless join decomposition, and guaranteed to terminate.
e.g., CSJDPQV, key C, JP C, SD P, J S To deal with SD P, decompose into SDP, CSJDQV. To deal with J S, decompose CSJDQV into JS and CJDQV
Several dependencies may cause violation of BCNF. The order in which we ``deal with’’ them could lead to very different sets of relations!
30
BCNF and Dependency Preservation
In general, there may not be a dependency preserving decomposition into BCNF. e.g., CSZ, CS Z, Z C Not in BCNF; can’t decompose while preserving 1st FD.
Similarly, decomposition of CSJDQV into SDP, JS and CJDQV is not dependency preserving (w.r.t. the FDs JP C, SD P and J S). However, it is a lossless join decomposition. Adding JPC to the collection of relations gives a
dependency preserving decomposition. JPC tuples stored only for checking FD! (Redundancy!)
31
3NF Discussion
3NF decomposition v.s. BCNF decomposition: Use same decomposition steps, for a while 3NF may stop decomposing, while BCNF continues
Tradeoffs BCNF = no anomalies, but may lose some FDs 3NF = keeps all FDs, but may have some anomalies
32
Summary of Schema Refinement
If a relation is in BCNF, it is free of redundancies that can be detected using FDs. Thus, trying to ensure that all relations are in BCNF is a good heuristic.
If a relation is not in BCNF, we can try to decompose it into a collection of BCNF relations. Must consider whether all FDs are preserved. If a lossless-
join, dependency preserving decomposition into BCNF is not possible (or unsuitable, given typical queries), should consider decomposition into 3NF.
Decompositions should be carried out and/or re-examined while keeping performance requirements in mind.