- Vamshi Myana Data Warehouse Concepts & Terminology
- Vamshi Myana
Data Warehouse Concepts&
Terminology
Contents What is Datawarehouse? Why Separate Data Warehouse? Data Granularity Difference between OLTP & DW Datawarehouse Architecture Top-Down Versus Bottom-Up Approach Data Warehouses Versus Data Marts Dimensional Modeling Fundamentals Extraction, Transformation and Load Separate Data Warehouse? ETL(Extract Transform Load) & OLAP
What is Datawarehouse? A data warehouse is a relational database that is
designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
Data Warehouse Properties
DataWarehouse
Integrated
Time VariantNon Volatile
SubjectOriented
-- Bill Inmon, Building the Data Warehouse 1996
Subject-OrientedData is categorized and stored by business subjectrather than by application
EquityPlans Shares Customer
financialinformation
SavingsInsurance
Loans
OLTP Applications Data Warehouse Subject
Integrated
Constructed by integrating multiple, heterogeneous data sources
Relational databases, flat files, on-line transaction records
Data cleaning and data integration techniques are applied.
Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
• E.g. Hotel price: currency, tax, breakfast covered, etc.
Time-VariantData is stored as a series of snapshots, each representing a period of time
Time DataJan-97 JanuaryFeb-97 FebruaryMar-97 March
NonvolatileTypically data in the data warehouse is not updated or delelted.
Insert UpdateDelete
Read Read
Operational Warehouse
Load
Why Separate Data Warehouse? High performance for both systems
DBMS — tuned for OLTP: access methods, indexing, concurrency control, recovery
Warehouse — tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.
Different functions and different data: missing data: Decision support requires historical data which
operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
Datawarehouse terminology Enterprise Data warehouse
Collects all information about subjects (customers,products,sales,assets, personnel) that span the entire organization
Data MartDepartmental subsets that focus on selected subjects
Decision Support System (DSS)is not a product its an environment where Information technology is used to help the knowledge worker (executive, manager, analyst) make faster & better decisions.
Operational data store (ODS)Stores tactical data from production systems that are subject-oriented and integrated to address operational needs.
Online Analytical Processing (OLAP)An element of decision support systems (DSS), which provides analysis of data
stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data.
Data GranularityWhat is Granularity of your DW?
Granularity is the level of details we want to store in the data warehouse.
For a retail store, Point of Sale (POS) is the lowest granularity information available.
For banking it’s the account level details based on every day transactions.
Data Warehouse Versus OLTP
PropertyResponseTime
Operations
Nature of Data
Data Organization
Size
Data Source
Activities
OperationalSub seconds to seconds
DML
30-60 days
Applications
Small to large
Operational, Internal
Processes
Data Warehouse
Seconds to hours
Snapshots over time
Subject, time
Large to very large
Operational, Internal,External
Analysis
Primarily read only
Data warehouse Architectures
Data warehouse Architectures
Top-Down Versus Bottom-Up Approach Here are the two different basic approaches:
Overall data warehouse feeding dependent data marts Several departmental or local data marts combining into a
data warehouse.
In the first approach, you extract data from the operational systems; you then transform, clean, integrate, and keep the data in the data warehouse.
So, which approach is best in your case, the top-down or the bottom-up approach?
Top-Down Approach
The advantages of this approach are: A truly corporate effort, an enterprise
view of data Inherently architected—not a union of
disparate data marts Single, central storage of data about
the content Centralized rules and control
Top-Down Approach
The disadvantages are:Takes longer to buildHigh exposure/risk to failure Needs high level of cross-functional
skills High outlay without proof of concept
Bottom-Up Approach
The advantages of this approach are:Faster and easier implementation of
manageable piecesFavorable return on investment and
proof of conceptLess risk of failure Inherently incremental; can schedule
important data marts first
Bottom-Up Approach
The disadvantages are:Each data mart has its own narrow view
of dataPermeates redundant data in every data
martPerpetuates inconsistent and
irreconcilable data
Data Warehouses Versus Data Marts
Property Data Warehouse Data MartScope Enterprise DepartmentSubject Multiple Single-subjectData Source Many FewSize(typical) 100 GB to>1 TB <100 GBImplementation time Months to years Months
DataWarehouse
DataMart
Dimensional Model A dimensional model is a model in which the data is structurally classified
as fact or dimension. General characteristics:
Query oriented Structured around data usage not business rules Organized roughly into base facts and dimensions of those facts Based on identification of key grains of data and on characteristics of those
grains Consisting usually of snapshot, business data Looks to reduce the number and depth of joins
Two general patterns- Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
Example of Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_salesMeasures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Example of Snowflake Schema
STORE KEYStore Dimension
Store DescriptionCityStateDistrict IDRegion_IDRegional Mgr.
District_IDDistrict Desc.Region_ID
Region_IDRegion Desc.Regional Mgr.
STORE KEYPRODUCT KEYPERIOD KEY
DollarsUnitsPrice
Store Fact Table
Dimensional Modeling Terminology A Fact table stores measures as well as keys
representing relationships to various dimensions. Additive - Measures that can be added across all
dimensions. Semi Additive - Measures that can be added across few
dimensions and not with others. Non Additive - Measures that cannot be added across all
dimensions. Dimensions are perspectives with respect to
which an organization wants to keep record. It contain textual attributes that describe the facts
In the example, sales fact table is connected to dimensions location, product, time and organization. Measure "Sales Dollar" in sales fact table can be added across all dimensions independently or in a combined manner which is explained below. Sales Dollar value for a particular product Sales Dollar value for a product in a location Sales Dollar value for a product in a year within a location Sales Dollar value for a product in a year within a location sold or
serviced by an employee
Conformed Dimension Dimension tables that adhere to a common
structure, and therefore allow queries to be executed across star schemas.
Sales Schema
Inventory Schema
Item KeyItem Desc.Brand Desc.Category..
DATE KEYITEM KEYSTORE KEYPROMO KEYSales Fact
Item KeyItem Desc.Brand Desc.Category..
DATE KEYITEM KEYSTORE KEYInventory Fact
Extraction, Transformation and Load
Purchase specialist tools, or develop programs Extraction-- Is mapping the data between
source systems and target database Transformation--validate, clean, integrate, and
time stamp data Load--Loading the transformed data into the
target system
OLTP Databases Staging File Warehouse Database
What is OLAP?What is OLAP?
Online Analytical Processing. Viewing data in a multi dimensional way.
Why OLAP? “Slice and dice” for data warehouse. RDBMS is a 2 dimensional way of storing /
viewing the data OLAP is a multi dimensional way of storing /
viewing the data
OLAP operations Roll up (drill-up):
summarize data by climbing up
hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up from higher level
summary to lower level summary or detailed data, or introducing new dimensions
OLAP operations Slicing: Selecting the
dimensions of the cube to be viewed. Example: View “Sales
volume” as a function of “Product ” by “Country “by “Quarter”
Dicing: Specifying the values along one or more dimensions. Example: View “Sales
volume” for “Product=PC” by “Country “by “Quarter”
Types in OLAP?Three types of OLAP in the industry.
1.MOLAP – Multi dimensional OLAP (Ex MSOLAP, Essbase, Cognos).
2.ROLAP – Relational OLAP ( Ex Business Objects, Microstrategy).
3.HOLAP – Hybrid OLAP
Architecture diagram of ROLAP
DataWarehouseOr
Data Mart
App Server
ROLAP toolsLike
BOCognos
MicrostrategyEtc
BI Metadata
OLAPReport1
OLAPReport2
OLAPReport n
When a report is executed by end user the actual SQL is issued to RDBMS to getthe data. Some BI tools can even store the results set in the application server andperiodically refresh that report based on the data refreshes which happen in DW.
Architecture diagram of MOLAP
DataWarehouseOr
Data Mart
MicrosoftAnalysisServices
BI Metadata
Cube defnetc
OLAPReport1
OLAPReport2
OLAPReport n
MOLAPcubes
MOLAPcubes
When a report is executed by end user the actual data is retrieved from the MOLAPcubes. The way it retrieves by using MDX queries based on the report. MDX standsfor Multidimensional expression. SQL is used to get the data RDBMS, MDX is usedto get the data from MOLAP. The MOLAP cubes are refreshed periodically based on the data refreshes which happen in DW.
Terminology
Cube –Cube –A cube is a A cube is a multidimensional structure multidimensional structure of data. Cubes are defined of data. Cubes are defined by a set of dimensions and by a set of dimensions and measures.measures.
Terminology
Time
Prod
ucts
Loca
tion
Dimension –Dimension –A structural attribute A structural attribute of a cube that acts as of a cube that acts as an index for identifying an index for identifying values within a multi-values within a multi-dimensional array.dimensional array.If all dimensions have If all dimensions have a single member a single member selected, then a single selected, then a single cell is defined. cell is defined.
Terminology
Measures –Measures –Numeric data of Numeric data of interest. interest. e.g. Revenue per Sale, e.g. Revenue per Sale, Quantity Quantity
Time
Prod
ucts
Loca
tion
China
China PeruPeruJapan
JapanItalyItaly
Janu
ary
Janu
ary
Febr
uary
Febr
uary
Mar
chM
arch
Apri
lAp
ril
CoffeeCoffeeApplesApples
TeaTeaOnionsOnions €1.95
SummaryThis session covered the following topics:What is Datawarehouse?Difference between OLTP & DWData warehouse Architecture and
approachDimensional ModelingWhat is OLAP?
Questions ?
Thank You.