Database Design & Schema Refinement

Database Design & Schema Refinement

Professor Navneet GoyalDepartment of Computer Science & Information SystemsBITS, Pilani

© Prof. Navneet Goyal, BITS, Pilani

Topics Database Design Steps Redundancy Schema Refinement

Minimizing Redundancy Functional Dependencies (FDs) Normalization using FDs

First Normal Form (1NF) Second Normal Form (2NF) Third Normal Form (3NF) Boyce-Codd Normal Form (BCNF)


Database Design Steps Requirements Analysis Conceptual Modeling (ER Model) Logical Modeling (Relational Model) Schema Refinement (Normalization)


Redundancy Same information at many places in

the DB Problems:

Wastage of Space Update Anomalies

• Update Anomaly• Insert Anomaly• Delete Anomaly

Normalization is used for “minimizing” redundancy


Update AnomaliesConsider the relation:EMP_PROJ ( Emp#, Proj#, Ename, Pname, No_hours) Update Anomaly: Changing the name of project

number P1 from “Billing” to “Customer-Accounting” may cause this update to be made for all 100 employees working on project P1

Insert Anomaly: Cannot insert a project unless an employee is assigned to it

Inversely - Cannot insert an employee unless he/she is assigned to a project.


Update AnomaliesConsider the relation:EMP_PROJ ( Emp#, Proj#, Ename, Pname, No_hours) Delete Anomaly: When a project is deleted, it will

result in deleting all the employees who work on that project. Alternately, if an employee is the sole employee on a project, deleting that employee would result in deleting the corresponding project


SolutionDecompose the relation:EMP_PROJ ( Emp#, Proj#, Ename, Pname, No_hours)Into the following smaller relations:EMP (Emp#, Ename)PROJ (Proj#, Pname)EMP_PROJ ( Emp#, Proj#, No_hours) What happened to update anomalies? We need to find out the basis for

decomposing a relation to get rid of update anomalies


Redundancy Integrity constraints, in particular

functional dependencies, can be used to identify schemas with such problems and to suggest refinements.

Main refinement technique: decomposition (replacing ABCD with, say, AB and BCD, or ACD and ABD).

Decomposition should be used judiciously: Is there a reason to decompose a relation? What problems (if any) does the decomposition

cause?


Functional Dependencies Constraints on the set of legal

relations Require that the value for a certain

set of attributes determines uniquely the value for another set of attributes

A functional dependency is a generalization of the notion of a key


Functional Dependencies A functional dependency X Y holds over relation R

if, for every allowable instance r of R: t1 r, t2 r, (t1) = (t2) implies (t1) = (t2) i.e., given two tuples in r, if the X values agree, then the Y

values must also agree. (X and Y are sets of attributes.) An FD is a statement about all allowable instances of a

relation Must be identified based on semantics of application. Given some allowable instance r1 of R, we can check if it

violates some FD f, but we cannot tell if f holds over R! K is a candidate key for R means that K R

However, K R does not require K to be minimal!

X X YY


Let R be a relation schema R and R

The functional dependency holds on R if and only if for any legal relations r(R), whenever any two tuples t1 and t2 of r agree on the attributes , they also agree on the attributes . That is, t1[] = t2 [] t1[ ] = t2 [ ]

Example: Consider r(A,B ) with the following instance of r.

On this instance, A B does NOT hold, but B A does hold.

1 41 53 7

Functional Dependencies



t

u

A’s B’s

If t & u agree here

Then they must

agree here

A B


K is a superkey for relation schema R if and only if K R K is a candidate key for R if and only if

K R, and for no K, R

Functional dependencies allow us to express constraints that cannot be expressed using superkeys. Consider the schema:bor_loan = (customer_id, loan_number, amount )We expect this functional dependency to hold:loan_number amountbut would not expect the following to hold: amount customer_name



A functional dependency is trivial if it is satisfied by all instances of a relation Example:

• customer_name, loan_number customer_name

• customer_name customer_name In general, is trivial if



Consider the relation:PLOTS (prop#, state, plot#, area, price, Tax_rate)

Information about plots available in India. The constraints on the relation are: Prop# is unique throughout India Plot# are unique within a given state For a given_state, tax_rate is fixed Plots having the same area have the same price, irrespective

of the state in which they are located Write all the FDs on the relation

PLOTS



Functional DependenciesPLOTS

Prop# State Plot# Area Price Tax_rate

FD1 PK

FD2 CK

FD3

FD4

Identify redundancy in PLOTSIdentify update anomalies in PLOTS


Functional DependenciesPLOTS

FD1 PK

FD2 CK

Plot#StateProp# Area

PriceAreaFD4

Tax_rateFD3

State


Normal Forms based on PK 2 NF 3 NF

Normal Forms based on CKs Boyce-Codd Normal Form (BCNF)

Other Normal Forms 4 NF (Multivalued Dependencies) 5 NF (Join Dependencies) Deal with very rare practical situations

Normal Forms


Based on the concept of Full FDs (FFD) If A & B are sets of attributes of R, B is said to

be FFD on A if AB, but no proper subset of A determines B

No partial dependencies on the PK Is PLOTS in 2NF? YES Single attribute PK All relations with single attribute PK are in 2

NF!! 2 NF applies to relations with composite keys

2 NF


A relation that is in 1NF & every non-PK attribute is fully functionally dependent on the PK, is said to be in 2 NF

1 NF

2 NF

2 NFRemove all

Partial Dependencies


Based on the concept of transitive dependency

No non-PK attribute should be transitively dependent on the PK

Transitive DependencyIf AB & BC, then A transitively determines C through B, provided B & C do not determine A

Is PLOTS in 3NF? NO

3 NF


3 NFPLOTS

Prop# State Plot# Area Price Tax_rate

FD1 PK

FD2 CK

FD3

FD4Prop# transitively determines tax_rate through stateProp# transitively determines price through area


A relation that is in 1NF & 2 NF & no non-PK attribute is transitively dependent on the PK, is said to be in 3 NF

2 NF

3 NF

3 NFRemove all

Transitive Dependencies


Based on FDs that take into account all candidate keys of a relation

For a relation with only 1 CK, 3NF & BCNF are equivalent

A relation is said to be in BCNF if every determinant is a CK

Is PLOTS in BCNF? NO

BCNF


Consider the relation R(A,B,C) with functional dependencies ABC and CB.

• Is R in 2NF?• Is R in 3NF?• Is R in BCNF?

Problem 1


For the relation R (A,B,C,D), the Functional Dependencies are AB, AC, AD, & BA.

Find the candidate keys of R List transitive dependencies in R

(assume any CK as PK) Find the highest current normal

form of R

Problem 2


Closure of a set of FDs Given a set of FDs F on a relation R, it may

be possible that several other FDs must also hold for R

For Example, R=(A,B,C) & FDs, AB & BC hold in R, then FD AC also holds on R

For a given value of A, there can be only one corresponding value of B, & for that value of B, there can be only one corresponding value for C

The closure of F is the set of all FDs that can be inferred from F, & is denoted by F+


Equivalent Set of FDs Two sets of FDs, S & T, are

equivalent if the set of relation instances satisfying S is exactly the same as the set of relation instances satisfying T

S follows from T if every relation instance that satisfies T also satisfies all FDs in S

S & T are equivalent iff S follows from T, & T follows from S


Trivial, Non-trivial & Completely Non-trivial FDsAB Trivial

If B’s are a subset of the A’s Non-trivial

If atleast one of the B’s is not among A’s

Completely Non-trivialIf none of the B’s is also one of the A’s


Trivial Dependency RuleThe FD A1A2A3…AnB1B2B3…Bm

is equivalent to A1A2A3…AnC1C2C3…Ck

where the C’s are all those B’s that are not A’s


Closure of a set of FDs It is not suff. to consider just the given set of

FDs We need to consider all FDs that hold Given F, more FDs can be inferred Such FDs are said to be logically implied by F F+ is the set of all FDs logically implied by F We can compute F+using formal defn. of FD If F were large, this process would be

lengthy & cumbersome Axioms or Rules of Inference provide simpler

technique Armstrong;s Axioms


Inference Rules for FDsArmstrong's inference rules:IR1. (Reflexive) If Y X, then X YIR2. (Augmentation) If X Y, then XZ YZ

(Notation: XZ stands for X U Z)IR3. (Transitive) If X Y and Y Z, then X Z

IR1, IR2, IR3 form a sound & complete set of inference rules

Never generates any wrong FD

Generate all FDs that hold


Some additional inference rules that are useful:

IR4: Decomposition: If XYZ, then XY & XZIR5: Union: If XY & XZ, then XYZIR6: Psuedotransitivity: If XY & WYZ,then WXZ Above three inference rules, as well as any

other inference rules, can be deduced from IR1, IR2, and IR3 (completeness property)

Prove all the six rules (IR1 – IR6) – Use defn. of FD & either by direct proof or proof by contradiction

Inference Rules for FDs


Inference Rules for FDsIR1. (Reflexive) If Y X, then X YProof: Y X & t1 & t2 Є some instance r of R э

t1[X]=t2[X], then t1[Y]=t2[Y] because Y X.IR2. (Augmentation) If X Y, then XZ YZProof by contradiction: Assume XY holds but

XZYZ does not. Then there must exist 2 tuples t1 & t2 э 1. t1[X]=t2[X], 2. t1[Y]=t2[Y]

3. t1[XZ]=t2[XZ] & 4. t1[YZ]≠t2[YZ]Not possible because from 1 & 3 we deduce 5. t1[Z]=t2[Z], & from 2 & 5 we deuce 6. t1[YZ]=t2[YZ], contradicting 4


Example R = (A, B, C, G, H, I)

F = { A B A CCG HCG I B H}

some members of F+

A H • by transitivity from A B and B H

AG I • by augmenting A C with G, to get AG CG

and then transitivity with CG I CG HI

• By union rule


Procedure for Computing F+

To compute the closure of a set of functional dependencies F:

F + = Frepeatfor each functional dependency f in F+

apply reflexivity and augmentation rules on f add the resulting functional dependencies to F +for each pair of functional dependencies f1and f2 in F + if f1 and f2 can be combined using transitivity then add the resulting functional dependency to F +until F + does not change any further

NOTE: We shall see an alternative procedure for this task later


Closure of Attribute Sets Set of attributes functionally determined by X Closure of a set of attributes X with respect to F

is the set X+ of all attributes that are functionally determined by X

Algo. for computing closure: compute F+ & take all FDs with X on the LHS & take union of the RHS of all such FDs

X+ can be calculated by repeatedly applying IR1, IR2, IR3 using the FDs in F

Both these approaches become cumbersome if F is large & consequently F+ is larger


Closure of Attribute Sets Given a set of attributes define the closure of under F

(denoted by +) as the set of attributes that are functionally determined by under F

Algorithm to compute +, the closure of under F

result := ;while (changes to result) do

for each in F dobegin

if result then result := result end

Try to find out why this algorithm works!Complexity of this algorithmCan you do any better?


Example of Attribute Set Closure R = (A, B, C, G, H, I) F = {A B, A C, CG H, CG I, B H} (AG)+

1. result = AG2. result = ABCG (A C and A B)3. result = ABCGH (CG H and CG AGBC)4. result = ABCGHI (CG I and CG AGBCH)

Is AG a candidate key? 1. Is AG a super key?

1. Does AG R? == Is (AG)+ R2. Is any subset of AG a superkey?

1. Does A R? == Is (A)+ R2. Does G R? == Is (G)+ R


Uses of Attribute ClosureThere are several uses of the attribute closure

algorithm: Testing for superkey:

To test if is a superkey, we compute +, and check if +

contains all attributes of R. Testing functional dependencies

To check if a functional dependency holds (or, in other words, is in F+), just check if +.

That is, we compute + by using attribute closure, and then check if it contains .

Is a simple and cheap test, and very useful Computing closure of F

For each R, we find the closure +, and for each S +, we output a functional dependency S.


Canonical Cover Sets of functional dependencies may have

redundant dependencies that can be inferred from the others For example: A C is redundant in: {A B, B C} Parts of a functional dependency may be redundant

• E.g.: on RHS: {A B, B C, A CD} can be simplified to {A B, B C, A D}

• E.g.: on LHS: {A B, B C, AC D} can be simplified to {A B, B C, A D}

Intuitively, a canonical cover of F is a “minimal” set of functional dependencies equivalent to F, having no redundant dependencies or redundant parts of dependencies


Equivalence of Sets of FDs Two sets of FDs F and G are equivalent if:

- every FD in F can be inferred from G, &- every FD in G can be inferred from F

Hence, F and G are equivalent if F+=G+

Definition: F covers G if every FD in G can be inferred from F (i.e., if G+F+)

F and G are equivalent if F covers G and G covers F

There is an algorithm for checking equivalence of sets of FDs


Extraneous Attributes Consider a set F of functional dependencies and the

functional dependency in F. Attribute A is extraneous in if A

and F logically implies (F – { }) {( – A) }. Attribute A is extraneous in if A

and the set of functional dependencies (F – { }) { ( – A)} logically implies F.

Note: implication in the opposite direction is trivial in each of the cases above, since a “stronger” functional dependency always implies a weaker one

Example: Given F = {A C, AB C } B is extraneous in AB C because {A C, AB C}

logically implies A C (I.e. the result of dropping B from AB C).

Example: Given F = {A C, AB CD} C is extraneous in AB CD since AB C can be inferred

even after deleting C


Testing if an Attribute is Extraneous

Consider a set F of functional dependencies and the functional dependency in F.

To test if attribute A is extraneous in 1. compute ({} – A)+ using the dependencies in F 2. check that ({} – A)+ contains A; if it does, A is

extraneous To test if attribute A is extraneous in

1. compute + using only the dependencies in F’ = (F – { }) { ( – A)},

2. check that + contains A; if it does, A is extraneous


Canonical Cover A canonical cover for F is a set of dependencies Fc such that

F logically implies all dependencies in Fc, and Fc logically implies all dependencies in F, and No functional dependency in Fc contains an extraneous

attribute, and Each left side of functional dependency in Fc is unique.

To compute a canonical cover for F:repeat

Use the union rule to replace any dependencies in F 1 1 and 1 2 with 1 1 2 Find a functional dependency with an extraneous attribute either in or in If an extraneous attribute is found, delete it from

until F does not change

Note: Union rule may become applicable after some extraneous attributes have been deleted, so it has to be re-applied


Computing Canonical Cover R = (A, B, C)

F = {A BC, B C, A B, AB C} Combine A BC and A B into A BC

Set is now {A BC, B C, AB C} A is extraneous in AB C

Check if the result of deleting A from AB C is implied by the other dependencies• Yes: in fact, B C is already present!

Set is now {A BC, B C} C is extraneous in A BC

Check if A C is logically implied by A B and the other dependencies• Yes: using transitivity on A B and B C.

• Can use attribute closure of A in more complex cases The canonical cover is: A B, B C


Problems with DecompositionsThere are three potential problems to consider: Some queries become more expensive

• e.g., What is the price of prop# 1? Given instances of the decomposed relations, we

may not be able to reconstruct the corresponding instance of the original relation! • Fortunately, not in the PLOTS example• How we could say this?

Checking some dependencies may require joining the instances of the decomposed relations.• Fortunately, not in the PLOTS example• How we could say this?

Tradeoff: Must consider these issues vs. redundancy


Lossy Decomposition

A B C1 2 34 5 67 2 81 2 87 2 3

A B C1 2 34 5 67 2 8

A B1 24 57 2

B C2 35 62 8

JOINSpurious Tuples

Note that we can never get anythng less than the original relation

Since we don’t know which tuples are spurious and which are genuine, we have indeed lost information


Lossy Decomposition

S# StatusS3 30S5 30

S# CityS3 ParisS5 Athens

S# StatusS3 30S5 30

Status City30 Paris30 Athens

S# Status CityS3 30 ParisS5 30 Athens

1

2


Lossless Decomposition Observe that S satisfies the FDs:

S# Status & S# City It can not be a coincidence that S is equal to

the join of its projections on {S#, Status} & {S#, City}

Heaths’ Theorem:Let R{A,B,C} be a relation, where A, B, & C are sets of attributes. If R satisfies AB & AC, then R is equal to the join of its projections on {A,B} & {A,C}

Observe that in 2 the FD, S# City is lost


Lossless Decomposition

The decomposition of R into R1, R2, …Rn is lossless if for any instance r of R r = R1 (r ) R2 (r ) …… Rn (r )

We can replace R by R1 & R2, knowing that the instance of R can be recovered from the instances of R1 & R2

We can use FDs to show that decompositions are lossless


Lossless Decomposition

TheoremA decomposition of R into R1 and R2 is lossless join wrt FDs F, if and only if at least one of the following dependencies is in F+:

• R1 R2 R1• R1 R2 R2

In other words, R1 R2 forms a superkey of either R1 or R2


Dependency Preservation Let Fi be the set of dependencies

in F + that include only attributes in Ri.

• A decomposition is dependency preserving, if

(F1 F2 … Fn )+ = F +• If it is not, then checking updates for

violation of functional dependencies may require computing joins, which is expensive.


Testing for Dependency Preservation

To check if a dependency is preserved in a decomposition of R into R1, R2, …, Rn we apply the following test (with attribute closure done with respect to F) result =

while (changes to result) dofor each Ri in the decomposition

t = (result Ri)+ Ri

result = result t If result contains all attributes in , then the functional dependency

is preserved. We apply the test on all dependencies in F to check if a

decomposition is dependency preserving This procedure takes polynomial time, instead of the exponential

time required to compute F+ and (F1 F2 … Fn)+


Example R = (A, B, C )

F = {A B, B C}Key = {A}

R is not in BCNF Decomposition R1 = (A, B), R2 = (B, C)

R1 and R2 in BCNF Lossless-join decomposition Dependency preserving


4 NF BCNF removes any anomalies due to FDs Further research has led to the

identification of another type of dependency called Multi-valued Dependency (MVD)

Proposed by R Fagin* in 1977 MVDs can also cause data redundancy MVDs are a generalization of FDs

* R Fagin: “Multi-valued Dependencies & a new normal form for relational databases,” ACM TODS2, No. 3 (Sept. 1977)


4 NF Consider the following relation HCTX:

In relational databases, repeating groups are not allowed

Course Teacher TextsDBS N Goyal

J P MisraGarciaRaghu

ADBS J P Misra ConnollyGarcia


4 NF 1 NF Version

COURSE TEACHER TEXTS

DBS N GOYAL GARCIADBS N GOYAL RAGHU RDBS J P MISRA GARCIADBS J P MISRA RAGHU RADBS J P MISRA GARCIAADBS J P MISRA CONNOLLY

CTX

NO FDs in this relation


4 NF Highest Normal Form?



CTXBCNF?


4 NF Anomalies?



CTXMANY!!


4 NFAnomalies New Teacher for DBS New Text for ADBS Teacher teaching DBS leaves


4 NFPoints to note: If (c,t1,x1), (c,t2,x2) both appear, then

(c,t1,x2), (c,t2,x1) will also appear. Teachers and texts are completely independent of

one another CTX has no FDs at all CTX is in BCNF Any all key relation must necessarily be in BCNF!! But still there is a need to normalize CTX


4 NFDecompose CTX into CT & TX

COURSE TEACHERDBS N GOYALDBS J P MISRAADBS J P MISRA

COURSE TEXTDBS GARCIADBS RAGHU RADBS GARCIAADBS CONNOLLY

CT TX


4 NF Decompose CTX into CT & TX is not done

on the basis of FDs (as there are no FDs) Decompose CTX into CT & TX is done on

the basis of MVDs MVDs

Represents a dependency between attributes of a relation, such that for every value of A, there is a set of values of B & a set of values of C, The set of values for B & C are independent of each other

course teacher (course multi-determines teacher)

course text (text multi-dependent on course)


4 NF Interpretation of course teacher

Although a course does not have a single corresponding teacher, i.e. the FD course teacher does not hold

Still each course must have a ‘well defined’ set of teachers

For a given course c and a given text x, the set of teachers t matching the pair (c,x) depends on value of c alone

It makes no difference which particular value of x we choose

Interpret course text analogously


4 NFFormal Definition Let R be a relation and A,B,C be subsets of attributes

of R, then we say that A Biff, in every possible legal value of R, the set of B values matching a given (A,C) pair depends only on the value of A and is independent of the C value.

It can be easily shown that for R(A,B,C), the MVD A B hold iff the MVD A C also holds.

MVDs always go together in pairs and we write them asA B | Ccourse teacher | text


4 NFFagin Theorem Let R(A,B,C) be a relation where A,B,C and

be subsets of attributes of R, then R is equal to the join of its projections on {A,B} and {A,C} iff R satisfies the MVDA B | C


4 NF An MVDs A B is trivial if

(a) B A or (b) A U B = R

A relation that is in BCNF & contains no non-trivial MVDs is said to be in 4NF

CTX is not in 4NF because course teacher is a non trivial MVD


Multi-Valued Dependencies Most common source of redundancy

in BCNF schemas is to put 2 or more M:M relationships in a single relation


Formal Definition of MVD The MVD

A1A2….An B1B2…Bmholds for a relation R iffor each pair of tuples t & u that agree on As, we can find a tuple v that agrees

1. With t & u on As2. With t on Bs3. With u on all attributes of R that are

not among As & Bs


MVD

t

v

A’s B’sA B

Others

u


Problem Solving Consider a relation R (A,B,C,D,E,F)

with the following FDs:F = {ABC, BCAD, DE, CFB}

(a) Find out whether AB is a key of R or not.(b) Use the result of part (a) to find out

whether ABD is implied by F

AB+={ABCDE}

If D is in AB+, then ABD is implied by F

Q & A

Thank You

Database Design & Schema Refinement

Documents