Data Warehouse Concepts
Dec 22, 2015
DWH-Training Material 2
Chapter 1
• Data,Information,Knowledge,Decision• Analysis• Report
Chapter2
• Normalization• OLTP Systems• Characteristics of OLTP
Chapter 3
• Data Warehouse• Advantages of DataWarehouse• Goals of Data Warehouse
DWH-Training Material 3
Chapter 4
• Characteristics of Data Warehouse• Difference between OLTP/DW• OLAP• Data Warehouse/Data Mart• Data Warehouse Strategies
Chapter 5
• Dimension Modeling• Star Schema• Snow Flake Schema• Dimension Table• Conformed Dimension• Degenerated Dimension
Chapter 6
• Fact Table• Types of Fact• Metadata Management
DWH-Training Material 4
Chapter 7
• Grain Level• Surrogate Key• Time Dimension• Staging Area• Slowly Changing Dimensions
Chapter 8
• Project Overview• Phases of Project
DWH-Training Material 5
Data >> Decision
Raw Observations No Meaning
• Data
i• Information - Meaning by Relational Connection
• Knowledge -Appropriate collection of information -Intent is to be useful and to change the business process
DWH-Training Material 6
Action
What is Knowledge?
Data Information Knowledge
Raw Facts Data in context Information+Experience Knowledge applied Numbers Readily Captured to decision making
Strategic Value
DWH-Training Material 7
Analysis
• Comparison of Sales (Fact) of a product (dimension) over Years(dimension) in the same region(dimension).
• What is the total sales value(fact) of a particular product(dimension) in a store(dimension), in 3-months(dimension)?
• What is the amount spent(fact) for a particular product promotion(dimension) in a particular branch(dimension), in a particular city(dimension),in a year(dimension)?
DWH-Training Material 8
• Report: Collection of Data
• Purpose: Analysis- Comparitive Study of Data, Historical Data
• Final: Improve Decision
DWH-Training Material 10
Normalization• Normalization is the process od efficiently organizing data in a database.There
are two goals of the normalization process::
• Eliminating redundant data Ensuring data dependencies
• First Normal Form• First normal form (1NF) sets the very basic rules for an organized database• Eliminate duplicate columns from the same table• Create separate tables for each group of related data and identify each row
with a unique column or set of columns( the primary key)
DWH-Training Material 11
• Second Normal Form• Second Normal Form(2NF) further addresses the concept of removing
duplicative data• Meet all the requirements of the first normal form.• Create relationships between these new tables and their predecessors through
the use of foreign keys.
• Third Normal Form• Third Normal Form(3NF) remove columns which are not dependent upon the
primary key.• Meet all the requirements of the second normal form• Remove columns that are not dependent upon the primary key.
DWH-Training Material 12
Information System/OLTP Systems• OLTP systems- Highly Normalized databases• Purpose of OLTP systems is to capture data• Do DML activities• Purpose of Data Warehouse is for multidimensional analysis• OLTP applications like Equity Plans,Shares,Insurance,Loans,Savings
DWH-Training Material 13
Characteristics-OLTPCharacteristics OLTP
Operation Insert/Update
Analytical Requirements Low
Data per Transaction Small
Data Level Detailed
Orientation Records
DWH-Training Material 14
Business Intelligence• From an information systems standpoint, BI provides users with online analytical
processing or data analysis capabilities to predict trends, evaluate business questions and so on
• From a business analyst viewpoint, it is the process of gathering high quality,meaningful information about a subject, which enables the analyst to draw conclusions
DWH-Training Material 16
Data Warehouse
• Data warehousing is the entire process of data extraction, transformation and loading of data to the warehouse and the access of the data by end users and applications.
DWH-Training Material 18
Advantages through DW• Acquire new customers• Retain Existing customers• Improve customer satisfaction• Sell more products
DWH-Training Material 19
Goals of Data Warehouse
• Easy access to organization information• Data Warehouse must be adaptive and resilent to change• Secure environment to protect information assets.• Foundation for improved decision making,
DWH-Training Material 21
Data Warehouse Characteristics
• Subject- Oriented• Integrated• Non-Volatile• Time-Variant
DWH-Training Material 22
Difference- OLTP and DW• They are both databases• They both hold data• But, they have been designed for different scopes: Running the business (OLTP Systems) v/s managing the business(DWH):
Operational systems focus on present data. DWH’s focus on historical data(present,past) OLTP systems are optimized to insert/update and store data DWH are optimized to select/analyze data.
DWH-Training Material 23
OLTP v/s Data WarehouseOLTP OLAP(DW)
Access Read/Write Read – Lots of scan
Unit of Work Short, Simple Transaction Query
# Users Thousands Hundreds
DB Size 100 MB-GB 100 GB - Terabytes
Function Date of Date Operations Decision Support
DB Design Application Oriented Subject Oriented
Data Current, Up to date detailed
Historical, Summarized
DWH-Training Material 24
OLAP• OLAP is an acronym for Online Analytical Processing. OLAP
performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis. OLAP enables end-users to perform ad hoc analysis of data in multiple dimensions, thereby providing the insight and understanding they need for better decision making.
• OLAP operationsRoll-upDrill-downSlice and dicePivot (rotate)
DWH-Training Material 25
Data Mart – Data Warehouse• A Data Mart stores data for a limited number of subject areas, such as
marketing or sales data.
• A Data warehouse deals with multiple subject areas and is typically implemented and controlled by a central organization unit such as the corporate information factory. It is often called a central or enterprise data warehouse.
DWH-Training Material 26
Data Warehouse / Data MartsProperty Data Warehouse Data Mart
Scope Enterprise Department
Subjects Multiple Single
Data Source Many Few
Implementation time Months to Years Months
DWH-Training Material 27
Data Warehousing Strategies• Enterprise wide warehouse, top down, the Inmon methodology
• Data mart, Bottom up, the Kimball methodology
• When properly executed , both result in an enterprise-wide data warehouse, but with different architectures
DWH-Training Material 28
Top Down Approach
Data Warehouse
Data Marts
Marketing Sales
Finance
Marketing
Finance
SalesOperational Systems
External Data
DWH-Training Material 29
Bottom Up ApproachData Marts Data Warehouse
Legacy Data
Operations Data
External data sources
Marketing
Finance
SalesMarketing
SalesFinance
DWH-Training Material 32
Dimensional Modeling• Dimensional Modeling provides users the ability to view data based on
the organization of the business and the important characteristics of the data
• There are two major components of dimensional analysis: Dimensions, which determine how data will be presented; and Facts which determine what data will be presented.
DWH-Training Material 33
Dimension Table Examples• Retail – store name, zip code, product name, product category, day of
the week• Telecommunication – call origin, call destination• Banking – customer name, account number, branch, account officer• Insurance – Policy type, insured party
DWH-Training Material 34
Dimension Table CharacteristicsDimension tables have the following characteristics:• Contain textual information that represents the attributes of the
business• Contain relatively static data• Are joined to a fact through foreign key reference• They are hierarchical in nature and provide the ability to view data at
varying levels of details.
DWH-Training Material 35
Fact Table Examples• Retail -- number of units sold, sales amount
• Telecommunications -- length of the call in minutes, average number of calls
• Banking -- average monthly balance
• Insurance – claims amount
DWH-Training Material 36
Fact Table Characteristics• Fact table have the following characteristics
– Contain numerical metrics of the business– Can hold large volumes of data– Can grow quickly– Are joined to dimension table through foreign keys that reference
primary keys in the dimension tables
DWH-Training Material 39
Conformed Dimensions• An dimension Table which is shared across data marts or more than 1 Fact
table• Example:
– Calendar/Date/Time – Dimension– Customer Dimension– Product Dimension
DWH-Training Material 40
Degenerated Dimension• Degenerative dimension is something dimensional in nature but exist
in fact table
DWH-Training Material 41
Fact Tables• Types of Measures
– Additive facts– Non-additive facts– Semi-additive facts
DWH-Training Material 42
Fact Tables• Additive Facts
– Additive facts are facts that can be summed up through all of the dimensions in the fact table.
Example :Dollar value is additive fact. If we want to find out the amount for a particular place for a particular period of time, we can add the dollar amounts and come up with total amount.
DWH-Training Material 43
• Non- Additive FactsNon-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
Example: Measure height for ‘citizens by geographical location’, when we rollup ‘city’data to ‘state’ level data we should not add heights of the citizens rather we may want to use it ti derive ’count’
Example: percentage(%)
DWH-Training Material 44
• Semi-additive factsSemi-additive facts are facts that can be summed up for some of
the dimensions in the fact table, but not the others.
DWH-Training Material 45
Factless Fact Table
• A factless fact table is a fact table that does not have any measures.
Teacher_FKCourse_FKStudent_FKLocation_FK
Student_DimensionStudent_PK
Course_DimensionCourse_PK
Location DimesnionLocation_PK
Teacher DimensionTeacher_PK
DWH-Training Material 46
Metadata• Its data bout data• Vital to the warehouse• Used by everyone• The key to understanding warehouse information
DWH-Training Material 48
Grain Level
• Level at which the data has to be captured in the Fact tableExample• Each Sales Transaction• Insurance claim Transaction• Monthly Account
DWH-Training Material 49
Surrogate Keys• It has no meaning, other than stating uniqueness for each record
stored in the fact table i.e to implement primary keys of almost all dimension tables
• It is just a sequence no.• Advantages of surrogate key include
– Control over data– Avoid using the OLTP keys as data warehouse keys
DWH-Training Material 50
Data Staging
• Often used as an interim step between data extraction and later steps• No end user access to staging
Source Staging Target
DWH-Training Material 51
Slowly Changing Dimensions(SCD)
• Slowly changing dimension change gradually and occasionally over time.
Example: Employee change their address, name, marital status
DWH-Training Material 52
SCD Approach Results
Type1 Overwriting the old values in the dimension record
Only current Losing the ability to track the old history
Type2 Creating an additional dimension record(with a time stamp)at the time of the change with the new attribute values
History+ Current
Segmenting history very accurately between the old description and the new description
Type3 Creating new ‘current’ fields and move the old attribute in a precedent field
Previous +Current
Describe both historical and current view
DWH-Training Material 53
Business Analyst Architect
ETL Lead
SourceSystem Study OLAPLead
Data Modeler ETL Devs/Cons
OLAP Devs/Cons
DBA
Test Lead
Tester
Project Manager
DWH-Training Material 54
Phases of Project
Phase1 - Define
Phase2- Analysis
Phase4-Build
Phase3 - Design
Phase5-Test
Phase6-Production
DWH-Training Material 55
The Define Phase
Sol ID Hand off
Revisit Effort Estimation
Business Vision/Goal
Project Plan
Resource Plan
Sol ID Hand off
Revisit Effort Estimation
Business Vision/Goal
Project Plan
Resource Plan
Analyze Risk
Communication Plan
Escalation Plan
CTS’s or CTQ’s
Sample Weekly Report
DWH-Training Material 56
The Analysis Phase
Sol ID Hand off
Revisit Effort Estimation
Business Vision/Goal
Project Plan
Resource Plan
Sample Report Requirement
Source System Study
Business Requirement
Gap Analysis
Fact Dimension Matrix
Initiate Capacity Planning
Evaluate ETL Tools
Evaluate OLAP Tools
Loading Strategy-ETL
Availability of reusable components
Technical Architecture Strategy
DWH-Training Material 57
The Design Phase
Sol ID Hand off
Revisit Effort Estimation
Business Vision/Goal
Project Plan
Resource Plan
Design Technical Architecture
Logical Model
Design Alternate Solution
Physical Model
Set up Dev/Test Environment
Design ETL Architecture
ETL Specification
ETL Test Plan
Design OLAP Architecture
Reporting Specifications
Reporting Test Plan
DWH-Training Material 58
The Build Phase
Sol ID Hand off
Revisit Effort Estimation
Business Vision/Goal
Project Plan
Resource Plan
Create Database
Test ETL Mappings
Build ETL Mappings
Build OLAP Reports
Test OLAP Reports
DWH-Training Material 59
The Test Phase
Sol ID Hand off
Revisit Effort Estimation
Business Vision/Goal
Project Plan
Resource Plan
Train End users
Report Validation
Data Load Testing
UAT-User Acceptance Testing
Production Readiness Checklist