Top Banner
1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data Warehousing Presented by Joseph M. Wilson EPA
34

1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

Mar 31, 2015

Download

Documents

Clifton Brace
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

1Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

An Introduction to Data Warehousing

Presented by

Joseph M. WilsonEPA

Page 2: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

2Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

In the Beginning, life was simple…

Page 3: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

3Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

But…

Page 4: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

4Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Our information needs…

Page 5: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

5Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Kept growing. (The Spider web)

SOURCE: William H. Inmon

Page 6: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

6Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Purpose

To explore and discuss the purpose and principles of data warehousing.

Page 7: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

7Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Briefing Contents

Data Warehouse Concepts

Building a Data Warehouse

STORET Warehouse Example

Page 8: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

8Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

So What Is a Data Warehouse?

Definition: A data warehouse is the data repository of an enterprise. It is generally used for research and decision support.

By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.

OLTP systems are usually designed independently of each other and it is difficult for them to share information.

Page 9: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

9Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Why Do We Need Data Warehouses?

Consolidation of information resources Improved query performance Separate research and decision support functions

from the operational systems Foundation for data mining, data visualization,

advanced reporting and OLAP tools

Page 10: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

10Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

What Is a Data Warehouse Used for?

Knowledge discovery Making consolidated reports Finding relationships and correlations Data mining Examples

Banks identifying credit risks Insurance companies searching for fraud Medical research

Page 11: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

11Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Goals Structure Size Performance optimization Technologies used

How Do Data Warehouses Differ From Operational Systems?

Page 12: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

12Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Comparison Chart of Database Types

Data warehouse Operational systemSubject oriented Transaction oriented

Large (hundreds of GB up to several TB)

Small (MB up to several GB)

Historic data Current data

De-normalized table structure (few tables, many columns per table)

Normalized table structure (many tables, few columns per table)

Batch updates Continuous updates

Usually very complex queries Simple to complex queries

Page 13: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

13Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Design Differences

Star Schema

Data WarehouseOperational System

ER Diagram

Page 14: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

14Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Supporting a Complete Solution

Operational System-Data Entry

Data Warehouse-Data Retrieval

Page 15: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

15Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Data Warehouses, Data Marts, and Operational Data Stores

Data Warehouse – The queryable source of data in the enterprise. It is comprised of the union of all of its constituent data marts.

Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group.

Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated.

SOURCE: Ralph Kimball

Page 16: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

16Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Briefing Contents

Data Warehouse Concepts

Building a Data Warehouse

STORET Warehouse Example

Page 17: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

17Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Building a Data Warehouse

Analysis Design Import data Install front-end tools Test and deploy

Data Warehouse Lifecycle

Page 18: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

18Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Stage 1: Analysis

Identify: Target Questions Data needs Timeliness of data Granularity

Create an enterprise-level data dictionary Dimensional analysis

Identify facts and dimensions

Analysis

– Design

– Import data

– Install front-end tools

– Test and deploy

Page 19: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

19Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Stage 2: Design

Star schema Data Transformation Aggregates Pre-calculated Values HW/SW Architecture

– Analysis

Design

– Import data

– Install front-end tools

– Test and deploy

Dimensional Modeling

Page 20: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

20Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Dimensional Modeling

Fact Table – The primary table in a dimensional model that is meant to contain measurements of the business.

Dimension Table – One of a set of companion tables to a fact table. Most dimension tables contain many textual attributes that are the basis for constraining and grouping within data warehouse queries.

SOURCE: Ralph Kimball

Page 21: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

21Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Stage 3: Import Data

Identify data sources Extract the needed data from

existing systems to a data staging area

Transform and Clean the data Resolve data type conflicts Resolve naming and key conflicts Remove, correct, or flag bad data Conform Dimensions

Load the data into the warehouse

– Analysis

– Design

Import data

– Install front-end tools

– Test and deploy

Page 22: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

22Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Importing Data Into the Warehouse

OLTP 1

OLTP 2

OLTP 3

Data Staging Area DataWarehouse

Operational Systems(source systems)

Page 23: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

23Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Stage 4: Install Front-end Tools

Reporting tools Data mining tools GIS Etc.

– Analysis

– Design

– Import data

Install front-end tools

– Test and deploy

Page 24: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

24Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Stage 5: Test and Deploy

Usability tests Software installation User training Performance tweaking based on usage

– Analysis

– Design

– Import data

– Install front-end tools

Test and deploy

Page 25: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

25Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Special Concerns

Time and expense Managing the complexity Update procedures and maintenance Changes to source systems over time Changes to data needs over time

Page 26: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

26Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Briefing Contents

Data Warehouse Concepts

Building a Data Warehouse

STORET Warehouse Example

Page 27: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

27Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Goals of the STORET Central Warehouse

Improved performance and faster data retrieval Ability to produce larger reports Ability to provide more data query options Streamlined application navigation

Page 28: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

28Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Old Web Application Flow

Page 29: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

29Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Central Warehouse Application Flow

Search Criteria Selection

Report Size Feedback/Report Customization

Report Generation

Page 30: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

30Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

http://epa.gov/storet/dw_home.html

STORET Central Warehouse:

Web Application Demo

Page 31: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

31Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

STORET Central Warehouse – Potential Future Enhancements

More query functionality Additional report types Web Services Additional source systems?

STORET

StateSystem A

StateSystem B

Page 32: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

32Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Data Warehouse Components

Data

Data Clean-up andProcessing

Data Mart #1:

Data Mart #2

Data Mart #3

End User Applications

Report Writers

Ad Hoc Query Tools

Data Mining

feed

feed

feed

feed

Populate,replicate,recover

Populate,replicate,recover

Populate,replicate,recover

Data

Data

extract

extract

extract

Conformed dimensionsConformed facts

Conformed dimensionsConformed facts

Source Systems(Legacy)

Data Staging Area“The Data Warehouse”Presentation Servers

End UserData Access

Upload model resultsUpload cleaned dimensions

SOURCE: Ralph Kimball

Page 33: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

33Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Data Warehouse Components – Detailed

SOURCE: Ralph Kimball

Data

Storage:flat file (fastest);RDBMS;other

Processing:clean;prune;combine;remove duplicates;household;standardize;conform dimensions;store awaiting replication;archive;export to data marts

No user query services

Data Mart #1:OLAP (ROLAP and/or MOLAP) query services;dimensional;subject oriented;locally implemented;user group driven;may store atomic data;may be frequentlyrefreshed;conforms to DW Bus

Data Mart #2

Data Mart #3

End User Applications

Report Writers

Ad Hoc Query Tools

Modelsforecasting;scoring;allocating;data mining;other downstream systems;other parameters;special UI

feed

feed

feed

feed

Populate,replicate,recover

Populate,replicate,recover

Populate,replicate,recover

Data

Data

extract

extract

extract

Conformed dimensionsConformed facts

Conformed dimensionsConformed facts

Source Systems(Legacy)

Data Staging Area“The Data Warehouse”Presentation Servers

End UserData Access

Upload model resultsUpload cleaned dimensions

Page 34: 1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.

34Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation.

Briefing Contents

Data Warehouse Concepts

Building a Data Warehouse

STORET Warehouse Example