7/28/2019 39932886 Conf Dwh Concepts
1/78
1
Course Overview
What is Data Warehouse
OLTP Vs. Data Warehousing
Data Warehousing Architecture
Data Warehousing Schemas & Objects
Physical Design in Data Warehouse
Definition of Data Warehousing
7/28/2019 39932886 Conf Dwh Concepts
2/78
2
Course Overview
Data Warehousing basic Design
Approaches
Data Warehousing Operational
Processes
Technical Problems in Data
Warehousing
Representative DSS Tools
Business Intelligence
7/28/2019 39932886 Conf Dwh Concepts
3/78
3
What is a Data Warehouse?
A data warehouse is a relational database that is designed for query and analysisrather than for transaction processing. It usually contains historical data derived
from transaction data.
A data warehouse environment includes an extraction, transportation,
transformation, and loading (ETL) solution, online analytical processing (OLAP)
and data mining capabilities, client analysis tools, and other applications thatmanage the process of gathering data and delivering it to business users.
It is a series of processes, procedures and tools (h/w & s/w) that help the
enterprise understand more about itself, its products, its customers and the
market it services
7/28/2019 39932886 Conf Dwh Concepts
4/78
4
NOT possible to
purchase a DataWarehouse, but it is
possible to build one.
Data Warehouse is
NOT a specifictechnology
Facts !
7/28/2019 39932886 Conf Dwh Concepts
5/78 5
Who are the potentialCustomers ?
Which Products are sold themost ?
What are the region-wisepreferences ?What are the competitorproducts ?
What are the projectedsales ?
What if you sale morequantity of a particularproduct ?
What will be the impact
on revenue ?Results of promotionschemes introduced ?
Why Data Warehousing?
Need of Intelligent Information in Competitive Market
7/28/2019 39932886 Conf Dwh Concepts
6/78 6
Data Warehouse is a subject-oriented, integrated nonvolatile andtime-variant collection of data insupport of managementsdecisions.
William Imon
Defining Data warehouse
7/28/2019 39932886 Conf Dwh Concepts
7/787
Subject Oriented
The data in datawarehouse is organized
around the major subjectof the enterprise ( i.e.the high level entities).
The orientation aroundthe major subject areascauses the datawarehouse design to be
data driven.
The operational systemsare designed around theapplication andfunctions. e.g. Loans ,savings , credit cards in
case of a Bank. WhereData Warehouse isdesigned around asubject like Customer ,Product , Vendor etc.
OperationalSystems
DataWarehouse
Customer
Supplier
Product
Organized by processesor tasks
Organized bysubject
7/28/2019 39932886 Conf Dwh Concepts
8/788
Data Warehouse Data
Time Data
{Key
Time Variant
Data is stored as a series of snapshots or views which record how it is collected
across time.
It helps in Business trend analysis
In contrast to OLTP environment, data warehouses focus on
change over time that is what we mean by time variant.
7/28/2019 39932886 Conf Dwh Concepts
9/789
Integrated
Data is stored once in a single integrated location
Data Warehouse
Database
Subject = Customer
Auto PolicyProcessing
System
Auto PolicyProcessing
System
Customer
data
storedin several
databases
Fire PolicyProcessing
System
Fire PolicyProcessing
System
FACTS, LIFECommercial, Accounting
Applications
FACTS, LIFECommercial, Accounting
Applications
It is closely related with subject orientation.
Data from disparate sources need to be put in a consistent format.
Resolving of problems such as naming conflicts and inconsistencies
7/28/2019 39932886 Conf Dwh Concepts
10/78
10
Non-Volatile
Existing data in the warehouse is not overwritten or updated.
ExternalSources
Read-Only
Data
WarehouseDatabase
DataWarehouse
Environment
DataWarehouse
Environment
ProductionDatabasesProduction
Applications
ProductionApplications
Update
InsertDelete
Load
This is logical because the purpose of a data warehouse is to enable you to analyzewhat has occurred.
7/28/2019 39932886 Conf Dwh Concepts
11/78
11
So, whats different between OLTP
and Data Warehouse?
7/28/2019 39932886 Conf Dwh Concepts
12/78
12
OLTP vs. Data Warehouse
OLTP systems are tuned for known transactions and workloads while workload is
not known in a data warehouse
Special data organization, access methods and implementation methods are
needed to support data warehouse queries (typically multidimensional queries)
e.g., average amount spent on phone calls between 9AM-5PM in Pune during
the month of December
7/28/2019 39932886 Conf Dwh Concepts
13/78
13
OLTP vs. Data Warehouse
OLTP
Application Oriented
Used to run business
Detailed data
Current up to date
Isolated DataRepetitive access
Clerical User
WAREHOUSE (DSS)
Subject Oriented
Used to analyze business
Summarized and refined
Snapshot data
Integrated DataAd-hoc access
Knowledge User (Manager)
7/28/2019 39932886 Conf Dwh Concepts
14/78
14
OLTP vs Data Warehouse
OLTP
Performance Sensitive
Few Records accessed at a time (tens)
Read/Update Access
No data redundancy
Database Size 100MB -100 GB
DATA WAREHOUSE
Performance relaxed
Large volumes accessed at a
time(millions)
Mostly Read (Batch Update)
Redundancy present
Database Size 100 GB -
few terabytes
7/28/2019 39932886 Conf Dwh Concepts
15/78
15
OLTP vs Data Warehouse
OLTP
Transaction throughput is the
performance metric
Thousands of users
Managed in entirety
Data Warehouse
Query throughput is the
performance metric
Hundreds of users
Managed by subsets
7/28/2019 39932886 Conf Dwh Concepts
16/78
16
To summarize ...
OLTP Systems are
used to runa business
The Data Warehouse helps to
optimizethe business
7/28/2019 39932886 Conf Dwh Concepts
17/78
17
Data Warehouse Architectures
Centralized
In a centralized architecture, there exists only one data warehouse which stores alldata necessary for business analysis. As already shown in the previous section, the
disadvantage is the loss of performance in opposite to distributed approaches.
Central Architecture
7/28/2019 39932886 Conf Dwh Concepts
18/78
18
Federated
In a federated architecture the data is logically consolidated but stored in separate
physical databases, at the same or at different physical sites. The local data marts store
only the relevant information for a department.
The amount of data is reduced in contrast to a central data warehouse. The level of
detail is enhanced.
FederatedArchitecture
Data Warehouse Architectures Contd
7/28/2019 39932886 Conf Dwh Concepts
19/78
19
Tiered:
A tiered architecture is a distributed data approach. This processcan not be done in one step because many sources have to be integrated
into the warehouse.
On a first level, the data of all branches in one region is collected, in
the second level the data from the regions is integrated into one data
warehouse.
Advantages:
Faster response time
because the data is located
closer to the client
applications and Reduced volume of data to
be searched.
Tiered Architecture
Data Warehouse Architectures Contd
7/28/2019 39932886 Conf Dwh Concepts
20/78
20
Metadata
Data SourcesData Sources Data ManagementData Management AccessAccess
Complete Warehouse Solution Architecture
Operational Data
Legacy Data
The Post
VISA
External DataSources
EnterpriseData
Warehouse
Organizationally
structured
Extract
Transform
Load
Data Information Knowledge
Asset Assembly (and Management) Asset Exploitation
DataMart
DataMart
Departmentallystructured
Data
Mart
Sales
Inventory
Purchase
7/28/2019 39932886 Conf Dwh Concepts
21/78
21
Data Sources:
Legacy data
Operational data
External data resources
Data Management :
Metadata - At all levels of the data warehouse, information is required to support the
maintenance and use of the Data Warehouse.
Data Mart A data mart is a subject oriented data warehouse.
Data Warehouse Architecture Components
Disparate datasources
7/28/2019 39932886 Conf Dwh Concepts
22/78
22
Introduction To Data Marts
What is a Data Mart
From the Data Warehouse , atomic data flows to various departments for their
customized needs. If this data is periodically extracted from data warehouse
and loaded into a local database, it becomes a data mart. The data in Data Mart
has a different level of granularity than that of Data Warehouse. Since the data
in Data Marts is highly customized and lightly summarized , the departments can
do whatever they want without worrying about resource utilization. Also the
departments can use the analytical software they find convenient. The cost of
processing becomes very low.
7/28/2019 39932886 Conf Dwh Concepts
23/78
23
Data Mart Overview
Data Marts
Satisfy 80% of
the local end-
users requests
Sales Representatives
and Analysts
Human
Resources
Financial Analysts,
Strategic Planners,
and Executives
DM Marketing
DM Finance
DM Sales
DM HR
Data Warehouse
DM Sales
DM HR
DM Marketing
7/28/2019 39932886 Conf Dwh Concepts
24/78
24
From TheData Warehouse To Data Marts
Departmentally
Structured
Individually
Structured
Data WarehouseOrganizationallyStructured
Less
More
History
Normalized
Detailed
Data
Information
7/28/2019 39932886 Conf Dwh Concepts
25/78
25
Operational Data Store (ODS)
What is an ODS
An Operational Data Store (ODS) integrates data from multiple business operationsources to address operational problems that span one or more business functions.
An ODS has the following features:
Subject-oriented Organized around major subjects of an organization(customer, product, etc.), not specific applications (order entry, accounts
receivable, etc.).
Integrated Presents an integrated image of subject-oriented data which ispulled from fragmented operational source systems.
Current Contains a snapshot of the current content of legacy source systems.History is not kept, and might be moved to the data warehouse for analysis.
Volatile Since ODS content is kept current, it changes frequently. Identicalqueries run at different times may yield different results.
Detailed ODS data is generally more detailed than data warehouse data.Summary data is usually not stored in an ODS; the exact granularity depends on thesubject that is being supported.
7/28/2019 39932886 Conf Dwh Concepts
26/78
26
Operational Data Store (ODS) Contd
The ODS provides an integrated view of data in operational systems.
As the figure below indicates, there is a clear separation between the ODS and thedata warehouse.
A
B
C
EIS
DSS
Apps
PC
Operational
Data Store
Current or near
current data
Detailed data
Updates allowed
Historical data
Summary and detail
Non-volatile
snapshots only
Data Warehouse
7/28/2019 39932886 Conf Dwh Concepts
27/78
27
Benefits Of ODS
Supports operational reporting needs of the organization
Provides a complete view of customer relationships, the data for which might be
stored in several operational databases -- this data can include data from an
organizations internal systems, as well as external data from third-party vendors.
Operates as a store for detailed data, updated frequently and used for drill-downs
from the data warehouse which contains summary data.
Reduces the burden placed on other operational or data warehouse platforms by
providing an additional data store for reporting.
Provides more current data than in a data warehouse and more integrated than an
OLTP system
Feeds other operational systems in addition to the data warehouse
7/28/2019 39932886 Conf Dwh Concepts
28/78
28
Data Warehousing SCHEMAS & OBJECTS
A schema is a collection of database objects, including tables, views,indexes, and synonyms.
There is a variety of ways of arranging schema objects in the schema
models designed for data warehousing. The are:
Star Schema
Snowflake Schema
Galaxy Schema
7/28/2019 39932886 Conf Dwh Concepts
29/78
29
Star Schema: It Consists of a fact table connected to a set of dimensional
tables Data is in Dimension tables is De-Normalized
Snowflake Schema: It is refinement of star schema where some dimensional
hierarchy is normalized in to a set of dimensional tables
Galaxy Schema:Multiple fact tables share dimension tables viewed as a
collection of stars, therefore called galaxy schema
7/28/2019 39932886 Conf Dwh Concepts
30/78
30
Star SchemaA star schema a highly De-Normalized, query-centric model where information is
broken into two groups: facts and dimensions.
Time_DimTime_DimTime_DimTime_Dim
TimeKeyTimeKeyTheDate...
TheDate...
Sales_FactSales_FactTimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey
TimeKeyEmployeeKeyProductKeyCustomerKeyShipperKey
Required Data(Business Metrics)
or (Measures)...
Required Data(Business Metrics)
or (Measures)...
Employee_DimEmployee_DimEmployee_DimEmployee_DimEmployeeKeyEmployeeKey
EmployeeID...
EmployeeID...
Branch_DimBranch_DimBranch_DimBranch_DimBranchIDBranchIDBranchno...
Branchno...
Customer_DimCustomer_DimCustomer_DimCustomer_DimCustomerKeyCustomerKey
CustomerID...
CustomerID...
Shipper_DimShipper_DimShipper_DimShipper_DimShipperKeyShipperKey
ShipperID...
ShipperID...
S fl k S h
7/28/2019 39932886 Conf Dwh Concepts
31/78
31
Sales_fact
timeID {FK}
propertyID {FK}
branchID {FK}
clientID {FK}
promotionID {FK}
staffID {FK}
ownerID {FK}
offerPrice
sellingPrice
saleCommission
saleRevenue
Branch_Dim
branchID {PK}
branchNo
branchType
city {FK}
City
city {PK}
region {FK}
Region
region {PK}
country
Figure32.2
Fact Table
DimensionTables
Snowflake Schema
7/28/2019 39932886 Conf Dwh Concepts
32/78
32
Multiple Groups of Facts links by few common dimensions
Fact1
Fact2 Fact3
Dimension2Dimension1
Dimension4
Dimension5
Dimension3
Dimension7Dimension6
Galaxy Schema
7/28/2019 39932886 Conf Dwh Concepts
33/78
33
Data Warehousing Objects
All the three types of Schemas are described in the Data Modeling section
Various Objects used in Data Warehousing are:
Fact Tables
Dimension Tables
Hierarchies
Unique Identifiers
Relationships
7/28/2019 39932886 Conf Dwh Concepts
34/78
34
Data Warehousing Objects
Fact Tables:
Represent a business process, i.e., models the business process as an artifact in the
data model
Contain the measurements or metrics or facts of business processes
"monthly sales number" in the Sales business process
most are additive (sales this month), some are semi-additive (balance as of), someare not additive (unit price)
The level of detail is called the grain of the table
Contain foreign keys for the dimension tables
F t T
7/28/2019 39932886 Conf Dwh Concepts
35/78
35
Additive facts:
Additive facts are facts that can be summed up through all of the dimensions
in the fact table
Semi-Additive facts:
Semi-additive facts are facts that can be summed up for some of the dimensions
in the fact table
Non-additive facts:
Non-additive facts are facts that cannot be summed up for any of the
dimensions Present in the fact table
Fact Types :
Examples to illustrate Additive, Semi-Additive& Non Additive facts:
7/28/2019 39932886 Conf Dwh Concepts
36/78
36
& Non-Additive facts:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the Sales_Amount for each product in each store
On a daily basis. Sales_Amount is the fact.
In this case, Sales_Amount is an additive fact, because we can sum up this fact along
with any of the 3 dimensions present in the fact table date, store, and product
Fact table:
E f i Additi & N Additi f t
7/28/2019 39932886 Conf Dwh Concepts
37/78
37
Eg for semi-Additive & Non-Additive facts:
Date
Account
Current_Balance
Profit_Margin
Fact table:
The purpose of this table is to record the current balance for each account at the end ofeach day, as well as the profit margin for each account for each day
Current_Balance & Profit_Margin are the facts
Current_Balance is a semi additive fact, as it makes sense to add them up for all
accounts (whats the total current balance for all accounts in the bank?), but it does not
make sense to add them up through time
Profit_Margin is a non additive fact, for it does not make sense to add them up for the
account level or the day level
t pes of fact tables
7/28/2019 39932886 Conf Dwh Concepts
38/78
38
Based on the above classifications, there are two types of fact tables
Cumulative Snapshot
Cumulative: This type of fact table describes what has happened over a period of timeFor example this fact table may describe the total sales by product by store by day
The facts for this type of fact tables are mostly additive. The first example is a
Cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instance
Of time, and usually includes more semi additive and non-additive facts.
The second example presented is a snapshot fact table
types of fact tables :
D t W h i Obj t C td
7/28/2019 39932886 Conf Dwh Concepts
39/78
39
Data Warehousing Objects Contd.
Dimension Tables:
Dimension tables
Define business in terms already familiar to users
Wide rows with lots of descriptive text
Small tables (about a million rows)
Joined to fact table by a foreign key
heavily indexed
typical dimensions
time periods, geographic region (markets, cities), products, customers,salesperson, etc.
Di i t bl T
7/28/2019 39932886 Conf Dwh Concepts
40/78
40
Dimension tables Types
Dimension tables Types
Slowly Changing dimensions
Junk Dimensions
Confirmed Dimensions
Degenerated Dimensions.
Sl l Ch i Di i (SCD)
7/28/2019 39932886 Conf Dwh Concepts
41/78
41
Various data elements in the dimension undergo changes (e.g. changes in
attributes, hierarchical structures) which need to be captured for analysis.
SCD problem is a common one particular to data warehousing.
In a nutshell, this applies to cases where the attribute for a record varies over time.
For eg:
Customer key Name State
1001 Christina Illinois
Christina is a customer who first lived in chicago,illinois. At a later date, she moved to
Los Angeles,California. Now how to modify the table to reflect this change?
This is a Slowly Changing Dimension problem
Slowly Changing Dimensions :(SCD)
Types of SCD
7/28/2019 39932886 Conf Dwh Concepts
42/78
42
There are in general 3 ways to solve this type of problem, and they are
categorized as follows:
Type 1
Type 2
Type 3
Type 1: New record places the original record. No trace of the old record exists
Type 2: A new record is added to the customer dimension table
Type 3: The Original record is modified to reflect the change
Types of SCD
TYPE 1:
7/28/2019 39932886 Conf Dwh Concepts
43/78
43
New record places the original record. No trace of the old record exists
Eg:Customer key Name State
1001 Christina Illinois
After Christina moved from illinois to California, the new information replaces the
new record and we have the following table:
Customer key Name State
1001 Christina California
Advantages:
This is the easiest way to handle the Slowly Changing dimension, Since there
is no need to keep track of the old information.
Disadvantages:
All the history is lost. By applying this methodology, it is not possible to
track back in history. For eg In the above case, the company would not able to know
that Christina lived in Illinois before.
TYPE 1:
TYPE 2:
7/28/2019 39932886 Conf Dwh Concepts
44/78
44
In type 2 SCD a new record is added to the table to represent the new Information.Therefore both the original & the new record will be present
Eg:
After Christina moved from illinois to California, we add the new information as a
new row into the tableAdvantages:
This allows us to accurately keep all historical information
Disadvantages:
This will cause the size of the table to grow fast where the number of rows for the
table is very high to start with, storage and performance can become a concern
Customer key Name State
1001 Christina Illinois
1005 Christina California
TYPE 2:
TYPE 3:
7/28/2019 39932886 Conf Dwh Concepts
45/78
45
In type 3 SCD there will be two columns to indicate the particular attribute of interest, oneindicating the original value, and one indicating the current value. There will also be a
column that indicates when the current value becomes active.
Eg:
After Christina moved from illinois to California, the original information gets updated,
And we have the above table (Assuming the effective date of change is January 15,2003Advantages: This does not increase the size of the table, since new information is updated This allows us to keep some part of history
Disadvantages:
Type 3 will not be able to keep all history where an attribute is changed more than
Once. For eg, if Christina later moves from to Texas on December 15,2003 the
California information is lost
Customer key Name Original State Current State Effective Date
1001 Christina Illinois California 15-Jan-03
TYPE 3:
Degenerated Dimension:
7/28/2019 39932886 Conf Dwh Concepts
46/78
46
Degenerate dimension is a dimension which is derived from the fact tableand doesn't have its own dimension table.
Degenerate dimensions are often used when a fact table's grain represents
transactional level data and one wishes to maintain system specific identifiers
such as order numbers, invoice numbers and the like without forcing their
inclusion in their own dimension.
Degenerated Dimension:
Confirmed Dimensions :
7/28/2019 39932886 Conf Dwh Concepts
47/78
47
Dimension which is fixed and reusable.
It is also called as fixed dimension. It is a dimension which doesn't effect
with respect to time.
Ex : if the name of the city is changed from Bombay to Mumbai, the name
will not change from time to time, once the change is done ,The change is permanent.This type of dimensions are called confirmed or fixed dimensions.
Confirmed Dimensions :
Junk dimensions:
7/28/2019 39932886 Conf Dwh Concepts
48/78
48
A dimension where one can store random transactional codes,flags and text attributes that are not related to other dimensions
and which provides a simple way for users to easily find those
unrelated attributes.
Ex: Martial Status : (Yes or No)
Gender : (M or F) e.t.c.
Junk dimensions:
Data Warehousing Objects Contd.
7/28/2019 39932886 Conf Dwh Concepts
49/78
49
Data Warehousing Objects Contd.
Hierarchies:
Hierarchies are logical structures that use ordered levels as a meansof organizing data. A hierarchy can be used to define data aggregation.
For example, in a time dimension, a hierarchy might aggregate data from
the month level to the quarter level to the year level. A level represents a
position in a hierarchy.
Unique Identifiers:
Unique identifiers are specified for one distinct record in a dimension table. Artificial uniqueidentifiers are often used to avoid the potential problem of
unique identifiers changing.
Relationships:
Relationships guarantee business integrity. Designing a relationship betweenthe sales information in the fact table and the dimension tables products and customersenforces the business rules in databases.
Physical Design In Datawarehouse
7/28/2019 39932886 Conf Dwh Concepts
50/78
50
Physical Design In Datawarehouse
Physical design is the creation of the database with SQL statements. During the
physical design process, you convert the data gathered during the logical designphase into a description of the physical database structure.
Physical Design Structures:
Table spaces: A tablespace consists of one or more data files, which are physical
structures within the operating system you are using. A data file is associatedwith only one tablespace. From a design perspective, table spaces are containersfor physical design structures.
Tables and Partitioned Tables: Tables are the basic unit of data storage. They arethe container for the expected amount of raw data in your data warehouse. Usingpartitioned tables instead of non-partitioned ones addresses the key problem of
supporting very large data volumes by allowing you to decompose them intosmaller and more manageable pieces.
Physical Design In Data Warehouse Contd.
7/28/2019 39932886 Conf Dwh Concepts
51/78
51
y g
Views:
A view is a tailored presentation of the data contained in one or more tables or otherviews. A view takes the output of a query and treats it as a table. Views do notrequire any space in the database.
Integrity Constraints:
Integrity constraints are used to enforce business rules associated with yourdatabase and to prevent having invalid information in the tables. Integrity
constraints in data warehousing differ from constraints in OLTP environments. InOLTP environments, they primarily prevent the insertion of invalid data into a record,which is not a big problem in data warehousing environments because accuracy hasalready been guaranteed.
Indexes:
Indexes are optional structures associated with tables or clusters. In addition to theclassical B-tree indexes, bitmap indexes are very common in data warehousingenvironments.
Definition Of Data Warehouse
7/28/2019 39932886 Conf Dwh Concepts
52/78
52
Ralph Kimball's paradigm:
Data warehouse is the conglomerate of all data marts within the
enterprise. Information is always stored in the dimensional model.
Bill Inmon's paradigm:
Data warehouse is one part of the overall business intelligence system.
An enterprise has one data warehouse, and data marts source their
information from the data warehouse. In the data warehouse, information
is stored in 3rd normal form
Basic Design Approaches of Data Warehouse
7/28/2019 39932886 Conf Dwh Concepts
53/78
53
g pp
There are two major types of approaches to building or designing the
Data Warehouse.
The Top-Down Approach
The Bottom-Up Approach
The Top Down Approach
7/28/2019 39932886 Conf Dwh Concepts
54/78
54
The Dependent Data Mart structure or Hub & Spoke: The Top-Down Approach
Inmon advocated a dependent data mart structure
The data flow in the top down OLAP environment begins with data extractionfrom the operational data sources. This data is loaded into the staging area andvalidated and consolidated for ensuring a level of accuracy and then transferredto the Operational Data Store (ODS).
Detailed data is regularly extracted from the ODS and temporarily hosted in thestaging area for aggregation, summarization and then extracted and loaded intothe Data warehouse.
Once the Data warehouse aggregation and summarization processes arecomplete, the data mart refresh cycles will extract the data from the Datawarehouse into the staging area and perform a new set of transformations on
them. This will help organize the data in particular structures required by datamarts. Then the data marts can be loaded with the data and the OLAPenvironment becomes available to the users.
The Top Down Approach Contd
7/28/2019 39932886 Conf Dwh Concepts
55/78
55
Inmon Approach
The data marts are treated as sub sets of the data warehouse. Each data
mart is built for an individual department and is optimized for analysis needs
of the particular department for which it is created.
The Bottom-Up Approach
7/28/2019 39932886 Conf Dwh Concepts
56/78
56
1. The Data warehouse Bus Structure: The Bottom-Up Approach
Ralph Kimball designed the data warehouse with the data marts connectedto it with a bus structure.
The bus structure contained all the common elements that are used by datamarts such as conformed dimensions, measures etc defined for the enterpriseas a whole.
This architecture makes the data warehouse more of a virtual reality than aphysical reality
All data marts could be located in one server or could be located on differentservers across the enterprise while the data warehouse would be a virtualentity being nothing more than a sum total of all the data marts
In this context even the cubes constructed by using OLAP tools could beconsidered as data marts.
The Bottom-Up Approach Contd
7/28/2019 39932886 Conf Dwh Concepts
57/78
57
Kimball Approach
The bottom-up approach reverses the positions of the Data warehouse and
the Data marts. Data marts are directly loaded with the data from the operational
systems through the staging area.
The data flow in the bottom up approach starts with extraction of data from
operational databases into the staging area where it is processed and
consolidated and then loaded into the ODS.
The Bottom-Up Approach Contd
7/28/2019 39932886 Conf Dwh Concepts
58/78
58
The data in the ODS is appended to or replaced by the fresh data being
loaded. After the ODS is refreshed the current data is once again
extracted into the staging area and processed to fit into the Data mart
structure. The data from the Data Mart, then is extracted to the staging
area aggregated, summarized and so on and loaded into the Data Warehouse and
made available to the end user for analysis.
DW Operational Processes (Overview of
Extraction, Transformation & Loading)
7/28/2019 39932886 Conf Dwh Concepts
59/78
59
Typically host based, legacy applications
Customized applications, COBOL, 3GL, 4GL
Point of Contact Devices
POS, ATM, Call switches
External Sources
Nielsens, Acxiom, CMIE, Vendors, Partners
Sequential Legacy Relational ExternalOperational/Source Data
SourceData
DW Operational Processes (Overview of
Extraction, Transformation & Loading) Contd
7/28/2019 39932886 Conf Dwh Concepts
60/78
60
These tools try to automate or support tasks such as:-
Data Extraction (accessing diff source data bases)
Data Cleansing (finding and resolving inconsistencies in the source data)
Data Transformation (between different data formats, languages, etc.)
Data Loading
Replication (replicating source databases into the data warehouse)
Analyzing & Checking of Data Quality (for correctness and completeness)
Building derived data & views
DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd
7/28/2019 39932886 Conf Dwh Concepts
61/78
61
Elements of a Data Warehouse
DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd
7/28/2019 39932886 Conf Dwh Concepts
62/78
62
Loading the Warehouse
Cleaning the data before it is loaded
DW Operational Processes (Overview ofExtraction, Transformation & Loading) Contd
7/28/2019 39932886 Conf Dwh Concepts
63/78
63
These processes have been discussed in details in the ETL section.
Some important definitions:
Data Scrubbing: http://www.wisegeek.com/what-is-data-scrubbing.htm
Data Cleansing: http://www.wisegeek.com/what-is-data-cleansing.htm
Row level security: http://www.securityfocus.com/infocus/1743
Staging Types: http://esj.com/Columns/article.aspx?EditorialsID=55
Technical Problems in Data Warehouse
http://www.wisegeek.com/what-is-data-scrubbing.htmhttp://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.securityfocus.com/infocus/1743http://esj.com/Columns/article.aspx?EditorialsID=55http://esj.com/Columns/article.aspx?EditorialsID=55http://www.securityfocus.com/infocus/1743http://www.wisegeek.com/what-is-data-cleansing.htmhttp://www.wisegeek.com/what-is-data-scrubbing.htm7/28/2019 39932886 Conf Dwh Concepts
64/78
64
Managing large amounts of data:
The explosion of data volume came about because the data warehouse required that both detailand history be mixed in the same environ-ment.
Large amounts of data need to be managed in many ways-through flexibility of addressability ofdata stored inside the processor and stored inside disk storage, through indexing, throughextensions of data, through the efficient management of overflow, and so forth. To be effective,the technology used must satisfy the requirements for both volume and efficiency.
Index/Monitor Data:
If data in the warehouse cannot be easily and efficiently indexed, the data warehouse will not be
a success. Monitoring data warehouse data determines such factors as the following:If a reorganization needs to be done
If an index is poorly structured
If too much or not enough data is in overflow
The statistical composition of the access of the data
Available remaining space
Technical Problems in Data Warehouse Contd
7/28/2019 39932886 Conf Dwh Concepts
65/78
65
Interfaces to many technologies:
Data passes into the data warehouse from the operational environment
and the ODS, and from the data warehouse into data marts, DSS applications,explo-ration and data mining warehouses, and alternate storage.
This passage must be smooth and easy.
The interface to different technologies requires several considerations:
Does the data pass from one DBMS to another easily?
Does it pass from one operating system to another easily?
Does it change its basic format in passage (EBCDIC, ASCII, etc.)?
Technical Problems in Data Warehouse Contd
7/28/2019 39932886 Conf Dwh Concepts
66/78
66
Meta Data Management:
The data warehouse operates under a heuristic, iterative development life cycle.
To be effective, the user of the data warehouse must have access to meta data
that is accurate and up-to-date.
Several types of meta data need to be managed in the data warehouse: distrib-
uted meta data, central meta data, technical meta data, and business meta data.
Technical Problems in Data Warehouse Contd
7/28/2019 39932886 Conf Dwh Concepts
67/78
67
Efficient Loading of Data
Data is loaded into a data warehouse in two fundamental ways:
a record at a time through a language interface or en masse with a utility.
Indexes must be efficiently loaded at the same time the data is loaded. As the
burden of the volume of loading becomes an issue, the load is often par-allelized.
Another related approach to the efficient loading of very large amounts of data isstaging the data prior to loading.
As a rule, large amounts of data are gathered into a buffer area before being
processed by extract/transfer/load (ETL) software. The staged data is merged,
perhaps edited, summarized, and so forth, before it passes into the ETL layer.
Technical Problems in Data Warehouse Contd
7/28/2019 39932886 Conf Dwh Concepts
68/78
68
Lock Management:
The lock manager ensures that two or more people are not updating the
same record at the same time. But update is not done in the data warehouse;
instead, data is stored in a series of snapshot records. When a change occurs
a new snapshot record is added, rather than an update being done.
Steps in Building a Data Warehouse:
7/28/2019 39932886 Conf Dwh Concepts
69/78
69
Identify key business drivers, sponsorship, risks, ROI
Survey information needs and identify desired functionality and definefunctional requirements for initial subject area.
Architect long-term, data warehousing architecture
Evaluate and Finalize DW tool & technology
Conduct Proof-of-Concept
Design target data base schema
Build data mapping, extract, transformation, cleansing and
aggregation/summarization rules
Build initial data mart, using exact subset of enterprise data warehousing
architecture and expand to enterprise architecture over subsequent phases
Maintain and administer data warehouse
Representative DSS Tools
7/28/2019 39932886 Conf Dwh Concepts
70/78
70
Tool Category Products
ETL Tools ETI Extract, Informatica, IBM Visual WarehouseOracle Warehouse Builder
OLAP Server Oracle Express Server, Hyperion Essbase,IBM DB2 OLAP Server, Microsoft SQL Server OLAP
Services, Seagate HOLOS, SAS/MDDB
OLAP Tools Oracle Express Suite, Business Objects,Web Intelligence, SAS, Cognos Powerplay /Impromtu,
KALIDO, MicroStrategy, Brio Query, MetaCube
Data Warehouse Oracle, Informix, Teradata, DB2/UDB, Sybase, MicrosoftSQL Server, RedBricks
Data Mining & Analysis SAS Enterprise Miner, IBM Intelligent Miner,SPSS/Clementine, TCS Tools
Business Intelligence
7/28/2019 39932886 Conf Dwh Concepts
71/78
71
How intelligent can you make your business processes?
What insight can you gain into your business?
How integrated can your business processes be?
How much more interactive can your business be with customers, partners,
employees and managers?
What is Business Intelligence (BI)?
7/28/2019 39932886 Conf Dwh Concepts
72/78
72
Business Intelligence is a generalized term applied to a broad category ofapplications and technologies for gathering, storing, analyzing and providingaccess to data to help enterprise users make better business decisions
Business Intelligence applications include the activities of decision supportsystems, query and reporting, online analytical processing (OLAP), statisticalanalysis, forecasting, and data mining
An alternative way of describing BI is: the technology required to turn raw datainto information to support decision-making within corporations and businessprocesses
Why BI?
7/28/2019 39932886 Conf Dwh Concepts
73/78
73
BI technologies help bring decision-makers the data in a form they can quicklydigest and apply to their decision making.
BI turns data into information for managers and executives and in general, peoplemaking decisions in a company.
Companies want to use technology tactically to make their operations moreeffective and more efficient - Business intelligence can be the catalyst for thatefficiency and effectiveness.
Benefits
7/28/2019 39932886 Conf Dwh Concepts
74/78
74
The benefits of a well-planned BI implementation are going to be closely tied tothe business objectives driving the project.
Identify trends and anomalies in business operations more quickly, allowingfor more accurate and timelier decisions.
Deliver actionable insight and information to the right place with less effort .
Identify and operate based on a single version of the truth, allowing allanalysis to be completed on a core foundation with confidence.
Business Intelligence Platform Requirements
7/28/2019 39932886 Conf Dwh Concepts
75/78
75
Data Warehouse Databases
OLAP
Data Mining
Interfaces
Build and Manage Capabilities
The business intelligence platform should provide good integration across these
technologies. It should be a coherent platform, not a set of diverse and heterogeneous
technologies.
Business Intelligence Components
7/28/2019 39932886 Conf Dwh Concepts
76/78
76
TRANSFORM
LOAD
EXTRACT
OLAPDATA
MINING
Data
Warehouse
Operational Data
Business Intelligence Architecture
7/28/2019 39932886 Conf Dwh Concepts
77/78
77
Business Intelligence Technologies
7/28/2019 39932886 Conf Dwh Concepts
78/78
78
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Warehouses / Data Marts
Data Exploration
OLAP, DSS, EIS, Querying and Reporting
Data Mining
Information discovery
Data Presentation
Visualization Techniques
Decision Making
Increasing potential to support
business decisions End User
Business Analyst
Data Analyst
DB Admin