TM 1 Dr. Chen, Business Database Systems Presented By Aliya Saldanha DENORMALISATION PROS AND CONS
TM 1Dr. Chen, Business Database Systems
Presented By Aliya Saldanha
DENORMALISATIONPROS AND CONS
TM 2Dr. Chen, Business Database Systems
OBJECTIVES• Definition of terms.
• Describe the denormalization design process.
• Denormalization Strategies
• A Comparative Case Study
• Know the pros and cons of denormalization
• The Dangerous Illusion
• Conclude
TM 3Dr. Chen, Business Database Systems
Introduction
• RDBMS design - conceptual and physical modeling levels.
• Conceptual diagrams - precursor to designing relational tables.
• Critical issues- level of system performance, reflected by system response time
TM 4Dr. Chen, Business Database Systems
Normalization
• The normalized model is a cornerstone for every database system.
• Process of decomposing large, inefficiently structured tables into smaller, more structured tables without losing any data in the process.
There are still times where we denormalize a database to enhance performance
TM 5Dr. Chen, Business Database Systems
What is normalization?
• A series of steps followed to obtain a database • that is consistent and avoids duplication
• The process passes through fulfilling Normal Forms
• A table is said to be in a certain normal form if it satisfies certain constraints
• KEY POINTS• Each table represents a single subject• Keeps redundancy to a minimum• All attributes are dependent on the primary key• Checks stability and integrity of E-R diagram• Removes Insert, Update, Delete anomalies.
1st Normal Form
2nd Normal Form
3 rd Normal FormBCNF
4 th Normal Form
5 th Normal Form
Normalized relational db model
Relational db model
TM 6Dr. Chen, Business Database Systems
As normalization progresses…
• The number of Relations required to represent the data of the application being normalized increases.
• The increased number of tables require multiple JOIN’s to combine data from different tables. (more the joins the worse it gets)
• Queries that have a lot of complex joins will require more CPU usage and will adversely affect performance.
TM 7Dr. Chen, Business Database Systems
Practically speaking
• Queries run slowly.• Reports take too long to print.• On-screen forms take time to populate.• Web pages take too long to populate.• More complicated SQL required for multi-table
queries and joins.• In short, extra work for DBMS can mean slower
applications
TM 8Dr. Chen, Business Database Systems
Other issues…
• No calculated values. CV’s are a fact of life for all applications, but a normalized DBMS lacks them.
• Non-reproducible Calculations. Application must generate them on the fly as needed. If your application changes over time, you risk not being able to reproduce prior results.
• Join Jungles. When each fact is stored in exactly one place, you it is daunting to pull together everything for a certain query. Making it hard to code, hard to debug, and dangerous to alter.
• Performance. When you face a JOIN jungle you almost always face performance problems.
TM 9Dr. Chen, Business Database Systems
??before denormalizing
• Can the system achieve acceptable performance without denormalizing?
• Will the performance of the system after denormalizing still be unacceptable?
• Will the system be unreliable due to denormalization? • If the answer to any of these is "yes," avoid
denormalization because any benefit that is accrued will not exceed the cost.
TM 10Dr. Chen, Business Database Systems
Denormalization and Why?
• Frequently, performance needs dictate very quick retrieval capability for data stored in relational databases.
• To accomplish this, sometimes the decision is made to denormalize the physical implementation.
• Denormalization is the process of putting one fact in numerous places. This speeds data retrieval at the expense of data modification.
TM 11Dr. Chen, Business Database Systems
Does it mean Un-normalization ?
• ‘Denormalization’ does not mean that anything goes. Denormalization does not mean chaos.
• Un-normalized data model is little or no analysis is performed.
• In short, seek to denormalize a data model that has already been normalized.
TM 12Dr. Chen, Business Database Systems
DENORMALIZATIONPROCESS
Develop E-R
Refinement &Normalize
Identify candidates
Identifying form
Map to physical schema
Determining integrity effects
TM 13Dr. Chen, Business Database Systems
Development of Conceptual data model
• E-R/M aims at identifying the entities that are part of the system, the attributes that make up these entities, and the dependencies between entities.
• No Dependency among the attributes – Normalization resolves the functional dependencies between attributes
• Shows Data at rest – Denormalization considers the types of queries and their frequency
1
TM 14Dr. Chen, Business Database Systems
Refinement and normalization
• The ERD is further refined, in order to resolve the functional dependencies between the attributes of an Entity.
• May lead to splitting of tables to reduce data redundancy.
• Identifying candidates for denormalization• Application performance criteria.• Type of queries to be executed (update/retrieve).• Frequency of queries• Number of rows accessed by each transaction.• Cardinality – 1:1, 1:M• Derived data, Lookup data
2
3
TM 15Dr. Chen, Business Database Systems
Determine effect on data integrity
• The effect of denormalization is reviewed. • Denormalizing may lead to performance
degradation • Or unacceptable consistency issues.• In such a case Denormalization decision
must be reconsidered
4
TM 16Dr. Chen, Business Database Systems
Form for denormalized entity
• Identifying what form the denormalized entity may take
• We move down the normal forms ladder of steps.
5
Map conceptual scheme to physical scheme.
• Once the scheme is tested and verified it is implemented.
6
TM 17Dr. Chen, Business Database Systems
DENORMALIZATIONSTRATEGIES
• Pre joined Tables
• Report Tables
• Mirror Tables
• Split Tables
• Redundant Data
• Repeating Groups
• Derivable Data
• Speed Tables
TM 18Dr. Chen, Business Database Systems
Pre-joined tables
Two or more tables are joined and the result is stored as another table.
When the cost of joining is prohibitive
Example: Retail store databases
Contain only those columns absolutely necessary for application to meet processing needs.
The pre-joined table must be created periodically using SQL to join the normalized tables.
TM 19Dr. Chen, Business Database Systems
1:1 Relationships
TM 20Dr. Chen, Business Database Systems
M:M Relationship
TM 21Dr. Chen, Business Database Systems
The normalised tables
TM 22Dr. Chen, Business Database Systems
Denormalised tables
TM 23Dr. Chen, Business Database Systems
Report Tables
• When specialized critical reports are too costly to generate
• Create table that contains the report.• To be viewed in online environments.• Lot of formatting and data manipulation
TM 24Dr. Chen, Business Database Systems
Mirror tables
When tables are required concurrently by two different types of environments.
• If online processing and decision support access the same table
• Can duplicate table, use second table for read-only use
• Example: Heavy Online Traffic
• Care must be taken to periodically migrate the foreground data to background tables.
• Performance bottlenecks are resolved.
TM 25Dr. Chen, Business Database Systems
Split tables
When distinct groups use different parts of a table. - vertically - horizontally. The original table must be available for certain transactions.
TM 26Dr. Chen, Business Database Systems
Vertical Split
• Attributes are divided between the two tables, primary key put into both tables
• Particularly useful if one group of applications accesses some columns and another group accesses different columns
Example: Many columns of the customer table contain data specific to credit limit assessment, whereas others contain more general contact and customer profiling information
Split the table vertically, one partition containing credit limit information, and the other containing the more general customer details.
TM 27Dr. Chen, Business Database Systems
Horizontal Split
• Rows are divided between two tables• Usually rows are divided by range of key values
– The operation of UNION ALL, when applied later should not add more rows than contained in the original, un-split tables
• Example: For a large customer table, we might split it into two tables, one for home-based customers, and the other for overseas customers.
TM 28Dr. Chen, Business Database Systems
Redundant Data
Some columns of other table are made redundant in a given table. To reduce the number of table joins
Use when 1/more columns from one table are accessed whenever data from another table is accessed.
The original column must not be removed from the table.
Best for data that is not updated often.
Example: Consider the DEPARTMENT and EMPLOYEE tables. Queries require the name of the employee's department then the department name column could be carried as redundant data in the EMP table.
TM 29Dr. Chen, Business Database Systems
Repeating Groups
Another table is created that contains the columns corresponding to every element of group.
• Example A (Customer_No, Balance_period, Balance)
B (Customer_No, Balance_period1, Balance_period2, Balance_period3, Balance_period4, Balance_period5)
Points To Remember The data is rarely or never aggregated, averaged, or compared
within the row The data has a stable number of occurrences The data is usually accessed collectively The data has a predictable pattern of insertion and deletion
TM 30Dr. Chen, Business Database Systems
Derivable data
Derived data is data not directly stored in the database, but is instead calculated from the data which is stored in the database
• Cost of deriving data using complicated formulae is prohibitive then consider storing the derived data in a column instead of calculating it.
• Example: Score Calculation
– The stored derived data must be updated whenever the underlying data it is based on is changed.
TM 31Dr. Chen, Business Database Systems
Speed tables
• A speed table is a denormalized version of a hierarchy.
• Every parent has a row for every child that reports to it at any level, either directly or indirectly.
• A speed table optionally carries information such as level within a hierarchy and whether or not the child is at a detail most level within the hierarchy (bottom of tree)
• Used when tree like hierarchy is to be stored in database.• Data is replicated within a speed table to increase the
speed of data retrieval.
TM 32Dr. Chen, Business Database Systems
NORMALISED
DENORMALIZED
TM 33Dr. Chen, Business Database Systems
CASE STUDY-Prejoin
A simplified retail exampleBefore denormalization:
sale_id store_id sale_dt …
tx_id sale_id item_id … item_qty sale$
1
M
SALES
SALES_DETAIL
TM 34Dr. Chen, Business Database Systems
Prejoin Denormalization
A simplified retail example...
After denormalization:
t x _ i d sale_id store_id sale_dt product_id … product_qty $
SALES_AND_DETAILS
TM 35Dr. Chen, Business Database Systems
SAMPLE QUERY
• Q) What was my total volume between '06-AUG-08'and '06-AUG-09'?
• BEFORE denormalization:
select sum(sales_detail.product_qty)from sales ,sales_detailwhere sales.sale_id = sales_detail.sale_id and
sales.sale_date between TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');
TM 36Dr. Chen, Business Database Systems
TM 37Dr. Chen, Business Database Systems
Sample Query 2
• Q) What was my total volume between '06-AUG-08'and '06-AUG-09'?
• AFTER denormalization:
select sum(product_qty)from sales_and_detailswhere sales_and_details.sale_date between
TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');
TM 38Dr. Chen, Business Database Systems
TM 39Dr. Chen, Business Database Systems
Sample Query 3
• What happens if we ask about the number of “sales” rather than the quantity transacted?
BEFORE denormalization:
select count(*)from sales where sales.sale_date between TO_DATE('06-
AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');
TM 40Dr. Chen, Business Database Systems
TM 41Dr. Chen, Business Database Systems
• What happens if we ask about the number of “sales” rather than the quantity transacted?
• AFTER denormalization:
• select count(distinct sale_id)from sales_and_details where sales_and_details .sale_date between
TO_DATE('06-AUG-08','DD-Month-YY') and TO_DATE('06-AUG-09','DD-Month-YY');
TM 42Dr. Chen, Business Database Systems
TM 43Dr. Chen, Business Database Systems
PROS
Convenience• Using calculated values it is far easier for
programmers to generate reports without have generating code to calculate them.
• Saves CPU time.Simple Queries• Each eliminated JOIN is a simpler query that is
easier to get right the first time, easier to debug, and easier to keep correct when changed.
TM 44Dr. Chen, Business Database Systems
PROS
The Performance Argument• We end up improving performance (speed) because we
need fewer JOINs to retrieve the same number of facts.
The Storage Argument• Data availability the locations where it will be used. • The number of foreign keys are reduced (how separate
tables are related), the number of indexes are reduced (foreign keys are frequently indexed)
TM 45Dr. Chen, Business Database Systems
CONS
• Leads to data duplication and increases the storage requirement of the database.
• Documenting decisions, ensuring valid data, data migration.
• Having multiple copies leads to synchronization issues.
• Increased update time.
TM 46Dr. Chen, Business Database Systems
Physically speaking.,.,,
Performance determined entirely at the “physical database level “
– Storage and access methods– Hardware– Physical design– DBMS implementation details– Degree of concurrent access
TM 47Dr. Chen, Business Database Systems
AN ILLUSIONIn denormalization have an understanding that:1. Higher the normalization, greater the number of tables2. Greater number of tables require more joins 3. Joins slow performance4. Denormalization reduces number of tables and, hence less joins, improved performance.
The problem is that points 2 and 3 are not necessarily true, in which case point 4 does not hold and even if they hold true.
TM 48Dr. Chen, Business Database Systems
• It is claimed that from the integrity perspective, there are two database design options:
• Fully normalize the database thereby maximizing the simplicity of integrity enforcement;
• Denormalize the database and complicate integrity enforcement.
• According to the illusion argument, the first choice is the better option.
Why, then, the prevailing insistence on the second choice? The argument for denormalization is, of course, based on performance considerations
TM 49Dr. Chen, Business Database Systems
Conclusion
• In a real-life project, you have to bring back some data redundancy for performance reasons.
• Database design is about efficient data engineering - tradeoffs in design choices , choosing the right design for the performance requirements
• As stated by most database practitioners, denormalization may or may not result in a better performance or a more flexible data structure for users.
• Selective denormalization is usually required. Weigh and decide whether the perceived benefits are worth the effort to maintain the database properly.
• The importance of the present argument between its pros and cons is of a vital importance
TM 50Dr. Chen, Business Database Systems
References
• [1] G. Lawrence Sanders & Seung kyoon Shin , Denormalization Effects on Performance of RDBMS, Proceedings of the 34th Hawaii International Conference on System Sciences, 2001.
• [2] Denormalization strategies for data retrieval from data warehouses, Seung Kyoon Shina,*, G. Lawrence Sandersb,1a
• [3] Marsha Hanus, To Normalize or Denormalize, That is the Question, Candle Corporation
• [4] Denormalization Guidelines by Craig S. Mullins Published: PLATINUM technology, inc. June 1, 1997
• [5] Douglas B. Bock and John F. Schrage, Department of Computer Management and Information Systems, Southern Illinois University Edwardsville, published in the 1996 Proceedings of the Decision Sciences Institute, Orlando, Florida, November, 1996
• [6] The Dangerous Illusion: Denormalization, Performance and Integrity, Part 1 and Part 2, -Fabian Pascal, DM Review Magazine, July 2002
• [7] Service-Oriented Data Denormalization for Scalable Web Applications, Zhou Wei (Tsinghua University Beijing, China), Jiang Dejun (Tsinghua University), Guillaume Pierre (Vrije Universiteit Amsterdam), Chi-Hung Chi (Tsinghua Univers), Maarten van Steen(Vrije Universiteit Amsterdam);April 21-25, 2008. Beijing, China
• [8] Understanding Normalisation, by Micheal J Hernandez, 2001-2003. • [9] Hierarchical Denormalizing: A Possibility to Optimize the Data Warehouse
Design• By Morteza Zaker, Somnuk Phon-Amnuaisuk, Su-Cheng Haw• [10] How Valuable is Planned Data Redundancy in Maintaining the Integrity
of an Information System through its Database by Eghosa Ugboma , Florida Memorial University
• [11] Introduction to Databases, Database Design and SQL, Zornitsa Zaharieva, CERN
• [12] THE DATA ADMINISTRATION NEWSLETTER – TDAN.com
THANK YOUTHANK YOU
TM 52Dr. Chen, Business Database Systems
Anomalies
• Anomalies are inconsistencies in data that occur due to unnecessary redundancy.
• Update anomaly– Some copies of a data item are updated, but others are
not.• Insertion anomaly
– Can’t insert “real” data without also inserting unrelated or “made up” data
• Deletion anomaly– Can’t delete some data without also deleting other,
unrelated data
TM 53Dr. Chen, Business Database Systems
First Normal Form (1NF)
If a table of data meets the definition of a relation, it is in first normal form.
– Every relation has a unique name.– Every attribute value is atomic (single-valued).– Every row is unique.– Attributes in tables have unique names.– The order of the columns is irrelevant.– The order of the rows is irrelevant.
TM 54Dr. Chen, Business Database Systems
Second Normal Form (2NF)
• 1NF and no partial functional dependencies.• Partial functional dependency: when one or
more non-key attributes are functionally dependent on part of the primary key.
• Every non-key attribute must be defined by the entire key, not just by part of the key.
• If a relation has a single attribute as its key, then it is automatically in 2NF.
TM 55Dr. Chen, Business Database Systems
Second normal form 2NF
Student_ID Activity Fee222-22-2020 Swimming 30232-22-2111 Golf 100222-22-2020 Golf 100255-24-2332 Hiking 50
A relation that is not in 2NF
Key: Student_ID, Activity
Activity → Fee
Fee is determined by Activity
ACTIVITY
Student_ID Activity Fee
TM 56Dr. Chen, Business Database Systems
Student_ID Activity222-22-2020 Swimming232-22-2111 Golf222-22-2020 Golf255-24-2332 Hiking
Activity FeeSwimming 30
Golf 100Hiking 50
Fee
Divide the relation into two relations that now meet 2NF
Student_IDSTUDENT_ACTIVITY
Activity
ACTIVITY_COSTActivity
Key: Student_ID and Activity
Key: Activity
Activity → Fee
TM 57Dr. Chen, Business Database Systems
Third Normal Form (3NF)
• 2NF and no transitive dependencies• Transitive dependency: a functional
dependency between two or more non-key attributes.
TM 58Dr. Chen, Business Database Systems
A relation with a transitive dependency
Student_ID Building Fee222-22-2020 Dabney 1200232-22-2111 Liles 1000222-22-5554 The Range 1100255-24-2332 Dabney 1200330-25-7789 The Range 1100
Student_ID
HOUSINGBuilding Fee
Key: Student_ID
Building → Fee
Student_ID → Building→ Fee
TM 59Dr. Chen, Business Database Systems
Divide the relation into two relations that now meet 3NF
Student_ID
STUDENT_HOUSING
Building
Key: Student_ID
Student_ID → Building
Building FeeDabney 1200
Liles 1000The Range 1100
BUILDING_COST
Building FeeKey: Building
Building → Fee
TM 60Dr. Chen, Business Database Systems
Third Normal Form (3NF)• In 2NF and every non-key column is mutually
independent – means : Calculations
•Solution: Put calculations in queries and formsOrderDetailsOrderIDItemQuantityPrice
Put expression in text control or in query:=Quantity * Price
Item Quantity Price TotalHammer 2 $10 $20Saw 5 $40 $200Nails 8 $1 $8
TM 61Dr. Chen, Business Database Systems
BCNF
• 3NF and every determinant is a candidate key.
TM 62Dr. Chen, Business Database Systems
Student_ID Major Advisor222-22-2020 MIS Leigh232-22-2111 Management Gowan222-22-2020 MIS Roberts222-22-2111 Marketing Reynolds255-24-2332 Marketing Reynolds
A relation where a determinant is not a candidate key
Note: Students can have a double major and have an advisor for each major. An advisor works only with students in their assigned area.
Student_ID
STUDENT_ADVISOR
Advisor MajorPrimary Key: Student_ID, Major
Candidate Key: Student_ID, Advisor
Advisor → Major
TM 63Dr. Chen, Business Database Systems
Student_ID Advisor222-22-2020 Leigh232-22-2111 Gowan222-22-2020 Roberts222-22-2111 Reynolds255-24-2332 Reynolds
Divide the relation into two relations that meet BCNF
Student_ID
STUDENT_ADVISORKey: Student_ID, Advisor
Advisor MajorLeigh MIS
Gowan ManagementRoberts MISReynolds Marketing
ADVISOR_MAJOR
Advisor Major
Advisor
Key: Advisor
Advisor → Major
TM 64Dr. Chen, Business Database Systems
Speed Tables
TM 65Dr. Chen, Business Database Systems