7/27/2019 Best Dwh Basics
1/15
Dimension: The same category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute
in the Time Dimension.
or
Attribute represents a single type of information in a dimension. For example, year
is an attribute in the Time Dimension.
Hierarchy: The specification of levels that represents relationship between
different attributes within a dimension. For example, one possible hierarchy in the
Time dimension is Year Quarter Month Day.
Dimensional data model contains two types of tables. They are:
1)Fact table: Fact table in a dimensional data model contains the measures of all
interest, such measurements or metrics or facts of business processes. Take the
example of the sales amount of a business. The amount can be a monthly sales
number or sales number for a day. This measure is stored in the fact table with the
appropriate granularity.
For sales measures, a fact table generally contains three columns: a date column, a
store column and a sales amount column. Besides the measurements the table will
also contain foreign keys for the dimension tables.
or
It contains numeric values and also contain composite keys (i.e. collection offoreign keys)
E.g. sales and profit.
Dimension table:The dimension table in a dimensional model represents
the context of the measurements. The context of measurements can also be
understood as the characteristics such as who, what, where, when, how of a
measurement (subject).
For example, in a business process Sales, the characteristics of the 'monthly
sales number' measurement would be a Location (Where), Time (When) and
Product Sold (What). A dimension table contains a number of dimension
attributes or columns. In the Location dimension the various attributes can
be Location Code, State, Country, Zip code. Further, dimension attributes
contain one or more hierarchical relationships.In the Location dimension the
7/27/2019 Best Dwh Basics
2/15
various attributes can be Location Code, State, Country, Zip code. Further,
dimension attributes contain one or more hierarchical relationships.
or
It contains character values
E.g. Customer_name, Customer_city.
What is dimension modeling: A data model that maintains all the dimensionsin their own tables and the fact in a separate table (with the necessary relationshipswith all dimensions) is called Dimensional Model. This is a de-normalized model
as this is used for report generation. The only data feeds can be through a
scheduled and structured process (ETL) which in turn fetches data from a
relational / transactional data source(s).
Ex: Here's a different way to look at dimensional modeling:
There are three basic styles of data models:
1)Conceptual data model: The conceptual data model is sometimes called
the domain model and it is typically used for exploring domain concepts in
an enterprise with stakeholders of the project.
2)Logical data model: The logical model is used for exploring the domain
concepts as well as their relationships. This model depicts the logical entity
types, typically referred to simply as entity types, the data attributes
describing those entities, and the relationships between the entities.
3)Physical data model: The physical data model is used in the design of
the database's internal schema and as such, it depicts the data columns of
those tables, and the relationships between the tables. This model represents
the data design taking into account the facilities and constraints of any given
database management system. The physical data model is often derived from
the logical data model although some can reverse engineer this from any
database implementation.
Data/dimension modeling tools:
1.) Oracle Designer
2.) ERWin (Entity Relationship for windows)
3.) Informatica (Cubes/Dimensions)
4.) Embarcadero
5.) Power DesignerSybase
7/27/2019 Best Dwh Basics
3/15
Fact less Fact Table: A fact table contains only the keys i.e. foreign keys but no
measures (numerics) are known as fact less fact table.
or
A Fact Table having no Facts is known as Fact less Fact Table.
NOTE: Generally we using the fact less fact table when we want events that
happen only at information level but not included in the calculations level, just
information about an event that happen over a period.
APPROACHES:
At the time of software interrogation bottom/up is good but implementation time
top/down is good.
1)Top down: First we have to build data warehouse then we will build data marts.
Which will need more cross functional skills and time taking process also costly
ODS-->ETL-->Data warehouse-->Data mart-->OLAP
2)Bottom up: First we will build data marts then data warehouse. The data mart
that is first build will remain as a proof of concept for the others. Less time as
compared to above and less cost.
ODS-->ETL-->Data mart-->Data warehouse-->OLAP
How do we maintain Primary key in Fact Table ?
In data warehousing we are used surrogate keys by which we can change the value
of primary key.Suppose you have two table emp and dept and empno is the primary key of dept. table
and also it is used in emp table as fk In this case if we cannot modify the pk
because it is used as a foreign key in dept table. Thats why we need a extra columnswhich have no actual meaning. Here we have to take a extra columns ID assurrogate key in both table which have no meaning. But it can perform thejoins between two tables.
what is the difference between aggregate table and fact table ?
A fact table contains million of records and retrieving therecords from fact
table takes time. Where as aggregate tablecontains limited data from all the
required tables, and we retrieve the data it takes less time.
What is the difference between aggregate table and materialized view?
7/27/2019 Best Dwh Basics
4/15
Aggregate tables are pre-computed totals in the form of hierarchical
multidimensional structure. Where as materialized view is a database objectwhich caches the query result in a concrete table and updates it from theoriginal database table from time to time .Aggregate tables are used to speedup the query computing whereas materialized view speed up the data retrieval.
What is aggregate table and aggregate fact table?
Aggregate table contains summarized data. The materialized view is aggregated
tables.
For example, in sales we have only date transaction. if we want to create a report
like sales by product per year. in such cases we aggregate the date vales into
week_agg month_agg quarter_agg year_agg. to retrieve date from this tables we
use @aggrtegate function.
Schemas: Depends on the requirement we can choose the schemaIn designing data models for data warehouses / data marts, the most commonly
used schema types are Star Schema and Snowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and
business needs.
Some Points on star schema:
1) A star schema can be simple or complex. A simple star consists of one fact
table, a complex star can have more than one fact table.
2) In star schema, fact table in normalized format and dimension table is in de
normalized format.3) If performance is the priority then go for star schema, since here dimension
tables are de-normalized.
4) We use star schema when the query involves few joins and for better
performance. here data is de-normalized.
Some Points on snowflake schema:
1) Snowflake schema, both dimension and fact table is in normalized format only.
It is also known as Extended star schema.
2) Snowflake it requires more dimensions, more foreign keys and it will reduce thequery performance but it normalizes the records.
3) If memory space is the priority then go for snowflake schema, since here
dimension tables are normalized.
4) For complex joins we go for snowflake. performance is little
bit slower due to no. of joins. Here data is normalized.
http://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.html7/27/2019 Best Dwh Basics
5/15
Difference between Snowflake and Star Schema:
1) Star Schema means A centralized fact table and surrounded by different
dimensions
2) Star Schema contains Highly De-normalized Data
3) Star can not have parent table
4) Why need to go for Star schema:
Here a) less joiners contains
b) simply database
c) support drilling up options
1) Snowflake means In the same star schema dimensions split into another
dimensions
2) Snowflake contains Partially normalized
3) But snow flake contain parent tables4) Why need to go for Snowflake schema:
Here some times we used to provide separate dimensions from existing
dimensions that time we will go to snowflake
Disadvantage Of snowflake:
Query performance is very low because more joiners is there
Star Schema Definition: The star schema is the simplest data warehouse schema.
It is called a star schema because the diagram resembles a star with points radiating
from a center.
7/27/2019 Best Dwh Basics
6/15
Advantages:
Simplest DW schema
Easy to understand
Easy to Navigate between the tables due to less number of joins.
Most suitable for Query processing
Disadvantages:
Occupies more space
Highly De-normalized
Snowflake schema Definition: A Snowflake schema is a Data warehouse Schema
which consists of a single Fact table and multiple dimensional tables. These
Dimensional tables are normalized. A variant of the star schema where each
dimension can have its own dimensions.
7/27/2019 Best Dwh Basics
7/15
Advantages:
These tables are easier to maintain
Saves the storage space.
Disadvantages:
Due to large number of joins it is complex to navigate
Types of schemas:
1) Star Schema: In a star schema a central Fact table connects a number of
individual dimension tables this is called as a star schema.
It contains less joins so performance will be increase.
Star schema contains de-normalized data.
2) Snowflake Schema: One dimension table split into more than one dimension
this is known as snowflake schema.
It contains normalized data.
There are more joins in snowflake schema. so the performance is degrade.
3) Galaxy Schema: Galaxy schema is known as a
Fact constollation schema. It requires number of fact tables and Dimension tables
this is known as a Galaxy schema
7/27/2019 Best Dwh Basics
8/15
4) Star flake schema: Hybrid structure that contains a mixture of (de-normalized)
star and (normalized) snowflake schemas
NOTE:Mainly in real time ...when we want to use existing data warehousing
as source we will go for snow flake schema
Types of Facts:
1)Additive: Additive facts are facts that can be summed up through all of the
dimensions in the fact table.
2)Semi-Additive: Semi-additive facts are facts that can be summed up for some of
the dimensions in the fact table, but not the others.
Eg : Bank Balances - you can take a bank account as Semi-Additive since a currentbalance for the account can't be summed as time period; but if you want see current
balance of a bank you can sum all accounts current balance.
3)Non-Additive: Non-additive facts are facts that cannot be summed up for any of
the dimensions present in the fact table.
Eg: Ratios, Averages & Variance
Types of Fact Tables:
1)Cumulative: This type of fact table describes what has happened over a period
of time.
For example, this fact table may describe the total sales by product by store by
day. The facts for this type of fact tables are mostly additive facts. The first
example presented here is a cumulative fact table.
2)Snapshot: This type of fact table describes the state of things in a particular
instance of time, and usually includes more semi-additive and non-additive facts.
The second example presented here is a snapshot fact table.
Types of dimension tables:
There are many dimension tables. The commonly used are:
1) Confirmed dimension
7/27/2019 Best Dwh Basics
9/15
2) Junk dimension
3) Degenerated dimension
4) Slowly changing dimension
5) Rapidly changing dimension
The others are:
6) Virtual dimension
7) Regular dimension
8) Casual dimension
9) Shared dimension
10) Monster dimension
11) Inferred Dimension12) Role Playing Dimension
13) Shrunken Dimension
14) Out Triggers
15) Static Dimension
Slowly Changing Dimension: Attributes of a dimension that would undergochanges very rarely and commonly over the time.Ex: Customer Name SexOr
Slowly changing dimension (SCD) is the type of dimension which changes with
respect to time or period.
Ex: The employee of employee id say e23321 is presently in Hyderabad after a
month he is re-located in Bangalore than we can say the address dimension is SCD
w.r.t time
Rapidly Changing Dimension: Attributes of a dimension that changefrequently.
Or
Rapidly changing dimension is that where the dimensions changes quickly.
Ex: ATM transactions (banks).The data being changes continuously and
concurrently for each second so it is very difficult to capture this dimensions.
Conformed Dimension: The dimension table used by two or more fact tablesEx: Date dimensions
or
Conformed dimension is a dimension which is connected to or shared by more than
one fact table.
Eg: A business which takes care of both sales and orders of products then product
dimension becomes a conformed dimension for both sales fact and order fact
7/27/2019 Best Dwh Basics
10/15
Degenerate Dimension: The value of the dimension stored in fact table insteadof the dimension table.
or
The data items that are not facts and data items that do not fit into the existing
dimensions are termed as Degenerate Dimensions. Degenerate Dimensions are
used when fact tables represent transactional data. They can be used as primary
key for the fact table but they cannot act as foreign keys.
For example In sales fact table Invoice number is a degenerated dimension. Since
Invoice Number is not tied up to an order header table hence there is no need for
invoice number to join a dimensional table; hence it is referred as degenerate
dimension.
Junk Dimension: It is a table with the combination of different and unrelated
attributes to reduce the pk and fk relation.Ex: student attendance tracking
or
un wanted data which is not useful fo report generating purpose the data will be
placed in the particular table that table is known as junk dimension. Generally it is
used to provide extra informations.
Ex:any yes or no like status is an example for junk dimension
Differences between OLTP and OLAP are:
OLTP: Online Transactional Processing, which deals with transactions.
For e.g. withdrawals at ATM machines. It involves many transactions. The
databases have to be updated more frequently after the successful completion of a
transaction.
1) customer-oriented, used for data analysis and querying by clerks, clients and IT
professionals.
2) manages current data, very detail-oriented.
3) adopts an entity relationship(ER) model and an application-oriented database
design.
4) focuses on the current data within an enterprise or department.5) Is the E-R modleling, there are more concurrent users,
6) It contains normalized tables so there is no redundancy.
7) More tables, Joins and less Indexes,
8) It stores daily transactional data
9) It stores very less data
10) It contains mainly current data
7/27/2019 Best Dwh Basics
11/15
11) INSERT, UPDATE, MODIFY can be applied on OLTP.
12) Performance will be high
13) Users OLTP - clerk, DBA
14) OLTP - Transactional Process
15) No of Users OLTP-1000
OLAP: Online Analytical Processing, which deals with analysis of data. It has to
deal with historical data too (for analysis purpose) Not updated frequently. If
required bulk update is allowed.
1) market-oriented, used for data analysis by knowledge workers( managers,
executives, analysis).
2.) manages large amounts of historical data, provides facilities for summarization
and aggregation, stores information at different levels of granularity to support
decision making process.
3. ) adopts star, snowflake or fact constellation model and a subject-orienteddatabase design.
4) spans multiple versions of a database schema due to the evolutionary process of
an organization; integrates information from many organizational locations and
data stores
5) It is the Dimensional Modeling
6) It contains De-normalized tables there will be redundancy.
7) Less tables, Joins and more Indexes
8) It stores operational data
9) It contains Historical and Present data
10) only SELECT clause is applied on OLAP
11) It stores very Huge data
12) Performance will be low compared with OLTP
13) OLAP - Analytical Process
14) Users OLAP - Knowledge workers
1) Manager
2) Analysts
15) No of Users OLAP- 100
Types of OLAP:OLAP (ONLINE ANALYTICAL PROCESSING) is a set of specifications
which allows the client applications in retrieving the data from the Data
Warehouse for analytical process. There are 4 types of OLAPS we have
7/27/2019 Best Dwh Basics
12/15
1.) DOLAP (DESKTOP OLAP): The OLAP which communicates with
DESKTOP DATABASES to retrieve the data is called DOLAP.
Ex: cognos business objects tools.
2.) ROLAP (RELATIONAL OLAP): The OLAP which communicates withRELATIONAL DATABASES to retrieve the data is called ROLAP.
Ex: COGNOS REPORT NET BUSINESS OBJECTS MICROSTRATAGY
HYPERION
3.) MOLAP (MULTIDIMENSTIONAL OLAP): The OLAP which
communicates with MULTI DIMENSIONAL DATABASES to retrieve the
data is called MOLAP.
Ex: COGNOS HYPERION
4.) HOLAP (HYBRID OLAP): The OLAP which uses the combined features
of ROLAP MOLAP is called HOLAP.
Ex: COGNOS
OLAP Query:
Roll-up : display data that increase in aggregation level
Drill-down : display details using query for dimension table hierarchy
Pivot : makes cross tabulation
Slice and dice: Makes range selection on one or more dimension.
Snapshot: A Snapshot is the copy of data, when we create a snapshot it
copies the exact data that's related to the at particular report, we use snapshot
when ever we want to compare reports(ex we want to compare this months
report with previous months)
Differences between a Data Warehouse and a Data Mart:
Category Data Warehouse Data Mart
7/27/2019 Best Dwh Basics
13/15
Scope Corporate Line of Business (LOB)
Subject Multiple Single subject
Data Sources Many Few
Size (typical) 100 GB-TB+ < 100 GB
Implementation Time Months to years Months
slowly changing dimension: If the data in the dimension table happen to change
very rarely then it is called as slowly changing dimension.
ex: changing the name and address of a person which happens rerely.
The price of the product, address of the person, name of the city are few examples
of SCD.
This change can be implemented in three ways...
Type I: Replace the old record with a new record with updated data there bywe lose the history.
Type II: Create a new additional dimension table record with new value. Bythis way we can keep the history. We can determine which dimension is currentby adding a current record flag or by time stamp on the dimensional row.
Type III: In this type of implementation we create a new field in the dimensiontable which stores the old value of the dimension. When an attribute of the
dimension changes then we push the updated value to the current field and oldvalue to the old field.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the
original information. In other words, no history is kept.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, the new information replaces the
new record, and we have the following table:
Customer Key Name State
1001 Christina California
Advantages:
7/27/2019 Best Dwh Basics
14/15
This is the easiest way to handle the Slowly Changing Dimension problem, since
there is no need to keep track of the old information.
Disadvantages:
All history is lost. By applying this methodology, it is not possible to trace back in
history. For example, in this case, the company would not be able to know that
Christina lived in Illinois before.
Usage:
About 50% of the time.
When to use Type 1: Type 1 slowly changing dimension should be used when it is
not necessary for the data warehouse to keep track of historical changes.
In Type 2 Slowly Changing Dimension, a new record is added to the table to
represent the new information. Therefore, both the original and the new record will
be present. The newe record gets its own primary key.
In our example, recall we originally have the following table:
Customer Key Name State
1001 Christina Illinois
After Christina moved from Illinois to California, we add the new information as a
new row into the table:
Customer Key Name State
1001 Christina Illinois1005 Christina California
Advantages:
This allows us to accurately keep all historical information.
Disadvantages:
This will cause the size of the table to grow fast. In cases where the number of
rows for the table is very high to start with, storage and performance can become a
concern.
This necessarily complicates the ETL process.
Usage:About 50% of the time.
When to use Type 2: Type 2 slowly changing dimension should be used when it is
necessary for the data warehouse to track historical changes.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the
particular attribute of interest, one indicating the original value, and one indicating
7/27/2019 Best Dwh Basics
15/15
the current value. There will also be a column that indicates when the current value
becomes active.
In our example, recall we originally have the following table:
Customer Key Name State1001 Christina Illinois
To accommodate Type 3 Slowly Changing Dimension, we will now have the
following columns:
Customer Key
Name
Original State
Current State
Effective Date
After Christina moved from Illinois to California, the original information gets
updated, and we have the following table (assuming the effective date of change is
January 15, 2003):
Customer Key Name Original State Current State Effective Date
1001 Christina Illinois California 15-JAN-2003
Advantages:
This does not increase the size of the table, since new information is updated.
This allows us to keep some part of history.Disadvantages:
Type 3 will not be able to keep all history where an attribute is changed more than
once. For example, if Christina later moves to Texas on December 15, 2003, the
California information will be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3: Type III slowly changing dimension should only be used
when it is necessary for the data warehouse to track historical changes, and when
such changes will only occur for a finite number of time.