04/21/23 TCS Confidential 1
Course Roadmap• Why we use Data warehousing
• Difference between Operational System and Data Warehouse
• Introduction to Dataware housing
• Emergence of Decision Support Systems
• Data Warehousing Approaches
• Data Warehouse Technical Architecture
• Data Modelling concepts
• Operational Data Store
• Schema Design of Data warehouse
• Data Acquisation
Why We Need Data Warehousing ?• Better business intelligence for end-users
• Reduction in time to locate, access, and analyze information
• Consolidation of disparate information sources
• To Store Large Volumes of Historical Detail Data from Mission
Critical Applications
• Strategic advantage over competitors
• Faster time-to-market for products and services
• Replacement of older, less-responsive decision support systems
• Reduction in demand on IS to generate reports
OPERATIONAL DATABASE:
Online Transaction Processing
Designed for running the business and not suitable for analyzing the business in the prospect Of business executives because data volatile nature (Keep on changing)
It does not maintain historical data.
It contains only current data.
If u insert any new values it will updateEg: Acnthno Acnthsal 1072 13,000 20,000
OLTP Systems Vs Data Warehouse
users are different
data content is different,
data structures are different
hardware is differentUnderstanding The Differences Is The KeyUnderstanding The Differences Is The Key
OLTP Vs Data Warehouse
Operational System Data Warehouse
Transaction Processing Query Processing
Predictable CPU Usage Random CPU Usage
Time Sensitive History Oriented
Operator View Managerial View
Normalized Efficient
Design for TP
Denormalized Design for
Query Processing
Operational System Data Warehouse
Transaction Processing Query Processing
Predictable CPU Usage Random CPU Usage
Time Sensitive History Oriented
Operator View Managerial View
Normalized Efficient
Design for TP
Denormalized Design for
Query Processing
OLTP Vs WarehouseOperational System Data Warehouse
Designed for Atmocity,Consistency, Isolation andDurability
Designed for quite or staticdatabase
Organized by transactions(Order, Input, Inventory)
Organized by subject(Customer, Product)
Relatively smaller database Large database size
Many concurrent users Relatively few concurrentusers
Volatile Data Non Volatile Data
Operational System Data Warehouse
Designed for Atmocity,Consistency, Isolation andDurability
Designed for quite or staticdatabase
Organized by transactions(Order, Input, Inventory)
Organized by subject(Customer, Product)
Relatively smaller database Large database size
Many concurrent users Relatively few concurrentusers
Volatile Data Non Volatile Data
Operational System Data Warehouse
Stores all data Stores relevant data
Performance Sensitive Less Sensitive to performance
Not Flexible Flexible
Efficiency Effectiveness
Operational System Data Warehouse
Stores all data Stores relevant data
Performance Sensitive Less Sensitive to performance
Not Flexible Flexible
Efficiency Effectiveness
What is a Data Warehouse ?
• Data Warehouse Data Warehouse is a
• Subject-Oriented
• Integrated
• Time-Variant
• Non-volatile
WH Inmon - Regarded As Father Of Data WarehousingWH Inmon - Regarded As Father Of Data Warehousing
10
Subject Oriented Analysis
Data Warehouse StorageTransactional Storage
SalesSales
CustomersCustomers
ProductsProducts
EntrySales RepQuantity SoldPart NumberDate Customer NameProduct DescriptionUnit PriceMail Address
Process Oriented Subject Oriented
11
Integration of Data
Data Warehouse StorageTransactional Storage
Appl. A - M, FAppl. B - 1, 0Appl. C - X, Y
Appl. A - pipeline cm.Appl. B - pipeline inchesAppl. C - pipeline mcf
Appl. A - balance dec(13,2) Appl. B - balance PIC 9(9)V99Appl. C - balance float
Appl. A - bal-on-handAppl. B - current_balanceAppl. C - balance
Appl. A - date (Julian)Appl. B - date (yymmdd)Appl. C - date (absolute)
M, F
pipeline cm
balance dec(13, 2)
balance
date (Julian)In
tegr
atio
n
Encoding
Unit of Attributes
Physical Attributes
Naming Conventions
Data Consistency
12
Load
Access
Mass Load / Access of DataRecord-by-Record Data Manipulation
Insert
Access
Insert
Change
Delete
Change
Volatile Non-Volatile
Volatility of Data
Data Warehouse StorageTransactional Storage
13
Time Variant Data Analysis
Data Warehouse StorageTransactional Storage
Current Data Historical Data
0
5
10
15
20
Sales ( in lakhs )
January February March
Year97
Sales ( Region , Year - Year 97 - 1st Qtr)
EastWestNorth
14
Decision Support Systems (DSS)
What is DSS?
Need for DSS
Comparison of OLTP & DSS
Transition from Data Processing to Information
Processing
15
Enable users to get a “Business View” of the data
Facilitate Data based Decision Making that would drive and improve the Business
Discover “Hidden Trends”
What is DSS?
Decision Support SystemsDecision Support Systems (DSS) are interactive computer-based systems intended to help decision makers utilize data and models to identify and solve problems and make decisions. Data Warehouse is the foundation of DSS process. It is a Strategy and a Process for Staging Corporate Data.
Decision Support SystemsDecision Support Systems (DSS) are interactive computer-based systems intended to help decision makers utilize data and models to identify and solve problems and make decisions. Data Warehouse is the foundation of DSS process. It is a Strategy and a Process for Staging Corporate Data.
Why DSS?: How to answer these Business Queries?
What is the sales distribution region wise?
What is Defaulter’s Profile?
What are the slow movers in my product line?
How did my revenue improve in the past 5 years?
Which of my Sales Agentsare doing better?
Who are my profitable customers?
Currency Risk, Interest Rate Risk, Liquidity Risk
Strategic Planning / Budgeting
Which channel costs me more and pays less?
17
OLTP v/s DSS Environment
OLTP EnvironmentOLTP Environment• get data IN
• large volumes of simple transaction queries
• continuous data changes
• low processing time
• mode of processing
• transaction details
• data inconsistency
• mostly current data
DSS EnvironmentDSS Environment
• get information OUT
• small number of diverse queries
• periodic updates only
• high processing time
• mode of discovery
• subject oriented - summaries
• data consistency
• historical data is relevant
18
OLTP v/s DSS Environment
OLTP EnvironmentOLTP Environment• high concurrent usage
• highly normalized data structure
• static applications
• automates routines
DSS EnvironmentDSS Environment
• low concurrent usage
• fewer tables, but more columns per table
• dynamic applications
• facilitates creativity
DW Implementation Approaches
• Top Down
• Bottom-up
• Combination of both
• Choices depend on:– current infrastructure– resources– architecture– ROI– Implementation speed
Top Down Implementation
Bottom Up Implementation
DW Implementation Approaches
Top Down• More planning and design
initially• Involve people from
different work-groups, departments
• Data marts may be built later from Global DW
• Overall data model to be decided up-front
Bottom Up• Can plan initially without
waiting for global infrastructure
• built incrementally
• can be built before or in parallel with Global DW
• Less complexity in design
DW Implementation Approaches
Top Down• Consistent data definition
and enforcement of business rules across enterprise
• High cost, lengthy process, time consuming
• Works well when there is centralized IS department responsible for all H/W and resources
Bottom Up• Data redundancy and
inconsistency between data marts may occur
• Integration requires great planning
• Less cost of H/W and other resources
• Faster pay-back
24
DW Architectures
25
Data warehousing Architecture
Source 1
Source 2
Source 3
Source n
Sources
Cle
an
sin
g,
Tra
nsfo
rmati
on
& L
oad
ing
Staging Layer
Data Marts
Cubes-Conformed Dimensions
Data Warehouse
Summaries /
Aggregations
ODS
Detail Data
Transformation
Summarization Aggregation
Reporting Layer
Canned Reports
Ad-hoc analysis
Metadata
Extract-Push/Pull
Benefits of DWH
To formulate effective business, marketing
and sales strategies.
To precisely target promotional activity.
To discover and penetrate new markets.
To successfully compete in the marketplace
from a position of informed strength.
To build predictive rather than retrospective models.
Data Modeling
Data Modeling
WHAT IS A DATA MODEL? A data model is an abstraction of some aspect of
the real world (system). WHY A DATA MODEL?
• Helps to visualize the business
• A model is a means of communication.
• Models help elicit and document requirements.
• Models reduce the cost of change.
• Model is the essence of DW architecture based on which DW will be implemented
STEPS in DATA MODELINGProblem & scope definition
Requirement Gathering
Analysis
Logical Database Design
Deciding Database
Physical Database design
Schema Generation
Levels of modeling• Conceptual modeling
– Describe data requirements from a business point of view without technical details
• Logical modeling– Refine conceptual models– Data structure oriented, platform independent
• Physical modeling– Detailed specification of what is physically
implemented using specific technology
Conceptual Model
• A conceptual model shows data through business eyes.
• All entities which have business meaning.
• Important relationships
• Few significant attributes in the entities.
• Few identifiers or candidate keys.
Logical Model
• Replaces many-to-many relationships with associative entities.
• Defines a full population of entity attributes.
• May use non-physical entities for domains and sub-types.
• Establishes entity identifiers.
• Has no specifics for any RDBMS or configuration.
Physical Model
• A Physical data model may include– Referential Integrity– Indexes– Views– Alternate keys and other constraints– Tablespaces and physical storage objects.
Modeling Techniques
• Entity-Relationship Modeling
– Traditional modeling technique
– Technique of choice for OLTP
– Suited for corporate data warehouse
• Dimensional Modeling
– Analyzing business measures in the specific business context
– Helps visualize very abstract business questions
– End users can easily understand and navigate the data structure
• Relationship
– Relationship between entities - structural interaction and
association
– described by a verb
– Cardinality
• 1-1
• 1-M
• M-M
– Example : Books belong to Printed Media
Entity-Relationship Modeling - Basic Concepts
Entity-Relationship Modeling - Basic Concepts
• Attributes– Characteristics and properties of entities
– Example :• Book Id, Description, book category are attributes of entity
“Book”
– Attribute name should be unique and self-explanatory
– Primary Key, Foreign Key, Constraints are defined on Attributes
37
Examples: ER Model
Limitations of E-R Modeling
• Poor Performance
• Tend to be very complex and difficult to navigate.
39
Dimensional Modeling
Dimensional Modeling
• Dimensional modeling uses three basic concepts : measures, facts, dimensions.
• Is powerful in representing the requirements of the business user in the context of database tables.
• Focuses on numeric data, such as values counts, weights, balances and occurences.
• Must identify– Business process to be supported– Grain (level of detail)– Dimensions– Facts
Dimensional modeling
What is a Facts • A fact is a collection of related data items,
consisting of measures and context data.
• Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process.
• Facts are measured, “continuously valued”, rapidly changing information. Can be calculated and/or derived.
Types of Facts• Additive
– Able to add the facts along all the dimensions
– Discrete numerical measures eg. Retail sales in $
• Semi Additive
– Snapshot, taken at a point in time
– Measures of Intensity
– Not additive along time dimension eg. Account balance, Inventory balance
– Added and divided by number of time period to get a time-average
• Non Additive
– Numeric measures that cannot be added across any dimensions
– Intensity measure averaged across all dimensions eg. Room temperature
– Textual facts - AVOID THEM
Dimensions
• A dimension is a collection of members or units of the same type of views.
• Dimensions determine the contextual background for the facts.
• Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how
45
Dimensional Hierarchy
World
America AsiaEurope
USA
FL
Canada Argentina
GA VA CA WA
TampaMiami Orlando Naples
Continent Level
State Level
City Level
World Level
Country Level
Pare
nt R
elat
ion
Dimension Member / Business
Entity
Geography Dimension
Attributes: Population, Tourist’s Place
Dimensions Types
• Conformed Dimension
• junk Dimension
• Dirty Dimension
• Monster Dimension
• Slowly Changing Dimension
• Degenerated Dimension
46
47
Data marts
A data mart is a
• Powerful and natural extension of the data warehouse• Extends information to the departmental environment
from an enterprise environment• Interprets and structures data to suit departments’
specific needs
Data marts (DM)
Several names for DMs:
• departmental DSS DBs
• OLAP Data bases
• multi-dimensional DBs (MDDB)
• lightly summarized tables
48
Data marts
• Embedded data marts are marts that are stored within
the central DW. They can be stored relationally as files or
cubes.
• Dependent data marts are marts that are fed directly by
the DW, sometimes supplemented with other feeds, such as
external data.
• Independent data marts are marts that are fed directly
by external sources and do not use the DW.
DM - Types
49
ODS
An ODS
• pulls together, validates, cleanses and integrates data• foundation for providing integrated view of enterprise data• tactical decision support, day-to-day operations and
management reporting
Operational Data Store (ODS)
Characteristics
Integrated
Subject-oriented
Volatile (including update)
Current valued
50
ODS
Class I – Immediate Load.
Class II – Delayed Load
Class III – Overnight Load.
Class IV – Data warehouse Load.
ODS - Types
OLTP Vs ODS Vs DWH
Characteristic OLTP ODS Data Warehouse
Data redundancy Non-redundantwithin system;Unmanagedredundancy amongsystems
Somewhatredundant withoperationaldatabases
Managedredundancy
Data stability Dynamic Somewhat dynamic Static
Data update Field by field Field by field Controlled batch
Data usage Highly structured,repetitive
Somewhatstructured, someanalytical
Highlyunstructured,heuristic oranalytical
Database size Moderate Moderate Large to very large
Databasestructure stability
Stable Somewhat stable Dynamic
Star Schema Design
– Single fact table surrounded by denormalized dimension tables
– The fact table primary key is the composite of the foreign keys (primary keys of dimension tables)
– Fact table contains transaction type information.– Many star schemas in a data mart– Easily understood by end users, more disk storage
required
Example of Star Schema
Snowflake Schema – Single fact table surrounded by normalized dimension
tables– Normalizes dimension table to save data storage space.– When dimensions become very very large– Less intuitive, slower performance due to joins
• May want to use both approaches, especially if supporting multiple end-user tools.
Example of Snow flake schema
Snowflake - Disadvantages
• Normalization of dimension makes it difficult for user to understand
• Decreases the query performance because it involves more joins
• Dimension tables are normally smaller than fact tables - space may not be a major issue to warrant snowflaking
57
On-Line Analytical Processing (OLAP)
OLAP Cubes
OLAP is a category of applications/technology for
collecting
managing
processing
presenting
multidimensional data for analysis and management purposes
58
OLAP Cubes
• Subject oriented approach to Decision Support
• Calculations applied across dimensions, through hierarchies and/or across members
• Trend analysis over sequential time periods, What If scenarios.
• Slicing/Dicing subsets for on-screen viewing
• Drill-down/up along the hierarchy
• Reach-through to underlying detail data
• Rotation to new dimensional comparisons in the viewing area
OLAP Features
59
Multi-dimensional OLAP (MOLAP)
Relational OLAP (ROLAP)
Hybrid OLAP (HOLAP)
OLAP Categories
OLAP Cubes
60
MOLAP
• Use pre-calculated data set – CUBE
• Cube contains all possible answers to given range of questions
Features:
• Very fast response
• Ability to quickly write data into the cube
Downsides:
• Limited Scalability
• Inability to contain detailed data
• Load time
OLAP Cubes
61
OLAP Cubes
ROLAP
• Do not use pre-calculated CUBE
• Intercept query & pose it to the Relational DB
Features:
• Ask any question (not limited to the contents of the cube)
• Ability to drill downDownsides:
• Slow Response
• Some limitations on scalability
62
OLAP Cubes
HOLAP
• Combines MOLAP & ROLAP
• Utilizes both pre-calculated cubes & relational data sources
Features:
• For summary type info – cube, (Faster response)
• Ability to drill down – relational data sources (drill through detail to underlying data)
• Source of data transparent to end-user
Data Acquisation
• Data Extraction
• Data Transformation
• Data Loading
63