Top Banner
Designing a Data Warehouse Issues in DW design
33

Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Designing a Data Warehouse

Issues in DW design

Page 2: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Data Warehouse

A read-only database for decision analysis

Subject Oriented Integrated Time variant Nonvolatile

consisting of time stamped operational and external data.

Page 3: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Data Warehouse vsOperational Databases Highly tuned Real time Data Detailed records Current values Accesses small

amounts of data in a predictable manner

Flexible access Consistent timing Summarized as

appropriate Historical Access large

amounts of data in unexpected ways

Page 4: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Data Warehouse Purpose Identify problems in time to avoid

them Locate opportunities you might

otherwise miss

Page 5: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.
Page 6: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Data Warehouse:New Approach

An old idea with a new interest because of:

Cheap Computing PowerSpecial Purpose Hardware

New Data StructuresIntelligent Software

Page 7: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Warehousing Problems

Business IssuesData QuantityData AccuracyMaintenanceOwnershipCost

Page 8: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Warehousing Problems

Business IssuesDatabase Issues

DBMS SoftwareTechnologyComplexity

Page 9: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Business IssuesData IssuesAnalysis Issues

User InterfaceIntelligent Processing

Warehousing Problems

Page 10: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Three Approaches

Classical Enterprise DatabaseContains operational data from all areas of the organization.

Data MartExtracted and managerial support data designed for departmental or EUC applications

Data PackageData required for a specific application

Page 11: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Source Archived data

Extraction Batch extraction programs

Data Atomic transaction data

Tool VLDB technology

Analysis IT driven software

Classical Warehouse

Page 12: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Mart

Source Deposit or External sources

Extraction Batch summary

Data Designed departmental database

Tool OLAP, ROLAP, MDBMS

Analysis IT driven or trained user

Page 13: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Package

Source Mart

Extraction Sample and summary

Data Problem specific dataset

Tool PC tools

Analysis Trained user

Page 14: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Three Fundamental Processes Data Acquisition Data Storage Data a Access

Page 15: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.
Page 16: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Data Acquisition Handles acquisition of data from

legacy systems and outside sources.

Data is identified, copied, formatted and prepared for loading into the warehouse.

Page 17: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Acquisition steps Catalog the data

Develop an inventory of where it is and what it means.

Clean and prepare the data. Extract from legacy files and

reformat to make it usable. Transport data from one location

to another.

Page 18: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Storage

The storage component holds the data so that the many different data mining, executive information and decision support systems can make use of it effectively.

Page 19: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

The Storage Area

Managed by Relational databases

like those from Oracle Corp. or Informix Software Inc.

Specialized hardware symmetric multiprocessor (SMP) or massively parallel processor

(MPP) machines

Page 20: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Storage The majority of warehouse storage

today is being managed by relational databases running on Unix platforms.

Oracle, Sybase Inc., IBM Corp. and Informix control 65 percent of the warehouse storage market. Meta Group Inc. (1996)

Page 21: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Access Different end-user PCs and workstations

draw data from the warehouse with the help of multidimensional analysis products, neural networks, data discovery tools or analysis tools.

These powerful, "smart" software products are the real driving force behind the viability of data warehousing.

Page 22: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Access Tools Intelligent Agents and Agencies Query Facilities and Managed Query

Environments Statistical Analysis Data Discovery.

(decision support, artificial intelligence and expert systems)

OLAP Data Visualization

Page 23: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Hardware Budget A typical startup warehouse

project allocates more than 60 percent of its budget for hardware and software to the creation of a powerful storage component, spending just 30 percent on data mining and user access technologies.

Page 24: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Systems Analysis BudgetBudgeting for systems analysis and development, however, follows a very different pattern.

More than 50 percent of development dollars are spent on building acquisition capabilities,

30 percent fund the development of user solutions and

20 percent are dedicated to the creation of databases in the storage component.

Page 25: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.
Page 26: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Design Issues

Relational and Multidimensional Models

Denormalized and indexed relational models more flexible

Multidimensional models simpler to use and more efficient

Page 27: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Star Schemas in a RDBMS In most companies doing ROLAP, the DBAs have created countless indexes and summary tables in order to avoid I/O-intensive table scans against large fact tables. As the indexes and summary tables proliferate in order to optimize performance for the known queries and aggregations that the users perform, the build times and disk space needed to create them has grown enormously, often requiring more time than is allotted and more space than the original data!

Page 28: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Building a Data Warehouse from a Normalized DatabaseThe steps Develop a normalized entity-relationship

business model of the data warehouse. Translate this into a dimensional model.

This step reflects the information and analytical characteristics of the data warehouse.

Translate this into the physical model. This reflects the changes necessary to reach the stated performance objectives.

Page 29: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

The Business Model

Identify the data structure, attributes and constraints for the client’s data warehousing environment.

Stable Optimized for update Flexible

Page 30: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Business ModelAs always in life, there are some

disadvantages to 3NF: Performance can be truly awful. Most of

the work that is performed on denormalizing a data model is an attempt to reach performance objectives.

The structure can be overwhelmingly complex. We may wind up creating many small relations which the user might think of as a single relation or group of data.

Page 31: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Structural Dimensions The first step is the development of the

structural dimensions. This step corresponds very closely to what we normally do in a relational database.

The star architecture that we will develop here depends upon taking the central intersection entities as the fact tables and building the foreign key => primary key relations as dimensions.

Page 32: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Simple DW pattern.

Page 33: Designing a Data Warehouse Issues in DW design. Data Warehouse A read-only database for decision analysis Subject Oriented Integrated Time variant Nonvolatile.

Other Dimensions Categorical dimensions: generated

groups (additional key components)

Partitioning dimensions: subtypes (planned vs. actual)

Informational dimensions: generate different types of data (messy).