Top Banner
1 se Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing Data Warehousing “An Introduction” “An Introduction” Dr. Akhtar Ali School of Computing, Engineering and Information Sciences
22

1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

Dec 25, 2015

Download

Documents

Stanley Knight
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

1Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Data WarehousingData Warehousing“An Introduction”“An Introduction”Data WarehousingData Warehousing“An Introduction”“An Introduction”

Dr. Akhtar Ali

School of Computing, Engineering and Information Sciences

Page 2: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

2Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Lecture OutlineLecture OutlineLecture OutlineLecture Outline

New Trends for data/information management Background Two Approaches

Data Warehousing (DW) Definitions and History

DW Architectures Strategies for building data warehouses

Problems and Issues Maintenance and Performance

DW Support in database management systems

Page 3: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

3Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

1: New Trends for 1: New Trends for data/information managementdata/information management

1: New Trends for 1: New Trends for data/information managementdata/information management

Secondary storage is becoming more and more affordable. So enterprises keep more and more data Data replication is becoming widespread to avoid single

point of failure What to do with large volumes of data ?

Decision makers want to get more of data Decision support systems (DSSs)

» Have long execution time» Are CPU-intensive» Involve Statistical Analysis/Analytical queries

Transaction-oriented databases are not suitable for DSSs. Transactional data usually change rapidly Database and application servers are already at peak

loads Transactional data is usually normalized while DSSs

require summarised and highly aggregated data – and possibly de-normalized

Page 4: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

4Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Data ManagementData ManagementPast, Present and FuturePast, Present and Future

Data ManagementData ManagementPast, Present and FuturePast, Present and Future

Past File Processing (e.g. COBOL) Network and Hierarchical Databases

Present Relational, Object-Relational and Object-Oriented

Databases Fragmentation of Information Systems

» Subject/User/Application-Driven Transaction Processing Systems

» Stand-alone systems e.g. Manufacturing (Inventory Control) Finance (Payroll, Stock Management) Sales Administration (Planning, Suppliers, Daily Sales)

Future Integration of Data and Applications Data Exchange, Interoperability and Homogeneity in the

presence of Heterogeneity.

Page 5: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

5Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Surviving in the Information Surviving in the Information JungleJungle

Surviving in the Information Surviving in the Information JungleJungle

Different interfaces and protocols Different data models and representations Duplicate and Inconsistent Information

Page 6: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

6Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

SolutionSolutionIntegrated Information StoreIntegrated Information Store

SolutionSolutionIntegrated Information StoreIntegrated Information Store

Integration Systems Collect and combine information from multiple sources Provide integrated view and uniform user interface Support sharing of data and processing capabilities

Page 7: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

7Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Two ApproachesTwo Approaches1: On-Demand/Query-Driven1: On-Demand/Query-Driven

Two ApproachesTwo Approaches1: On-Demand/Query-Driven1: On-Demand/Query-Driven

On-Demand (Lazy) Data Integration is a kind of Virtual Data Warehouse

Page 8: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

8Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Disadvantages of On-Demand Disadvantages of On-Demand ApproachApproach

Disadvantages of On-Demand Disadvantages of On-Demand ApproachApproach

Poor response time due to delay in query processing Slow or unavailable data sources Time consuming and complex filtering and

integration Inefficient and potentially expensive for frequent

queries Wrappers compete on resources with local applications

at data sources There are only few notable systems based on this

approach e.g. TAMBIS: Transparent Access to Multiple Bio-informatics

Information Systems SRS: Sequence Retrieval System OPM (Object Protocol Model) based multi-database tools

and query language (OPM-QL)

Page 9: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

9Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Two ApproachesTwo Approaches2: Data Warehousing2: Data Warehousing

Two ApproachesTwo Approaches2: Data Warehousing2: Data Warehousing

In advance/ Eager data integration

Integrated data is persistently stored in a database – data warehouse for direct querying and analysis

Page 10: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

10Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Advantages of Data Advantages of Data Warehousing ApproachWarehousing Approach

Advantages of Data Advantages of Data Warehousing ApproachWarehousing Approach

High performance query processing Though the information returned may not be most up-to-

date Does not interfere with local data processing at sources

Analytical Querying/Statistical Analysis or On-Line Analytical Processing (OLAP) at warehouse

On-Line Transaction Processing (OLTP) at data sources Data Persistently Stored at Warehouse

Data at the warehouse can be further re-structured, aggregated, summarized and modified if necessary.

A DW may store historical/archive data. Data warehousing approach has been widely used e.g.

The Maryland ADMS Project Supporting Data Integration and Warehousing Using H2O The Stanford Data Warehousing Project GIMS: Genome Information Management System Marks & Spencer Data Warehouse

Page 11: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

11Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Trade-off between Query-Trade-off between Query-Driven and Driven and

Data Warehousing ApproachesData Warehousing Approaches

Trade-off between Query-Trade-off between Query-Driven and Driven and

Data Warehousing ApproachesData Warehousing Approaches Query-driven approach is still better for:

Rapidly changing information/data sources; Accessing very large amounts of data from many

sources; Clients with unpredictable and dynamic requirements

Data Warehousing is more suitable when: Data sources on which a data warehouse is based are

not frequently changing; Data up-to-dateness is not crucially important; Querying and Analysis is complex; Data needs to be highly summarized and aggregated; Fast access to integrated and derived data is vital; and Keeping data warehouse consistent with the underlying

data sources is efficient and does not compromise on expected performance.

Page 12: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

12Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

What is a Data Warehouse?What is a Data Warehouse?(a practitioner’s viewpoint)(a practitioner’s viewpoint)

What is a Data Warehouse?What is a Data Warehouse?(a practitioner’s viewpoint)(a practitioner’s viewpoint)

“A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context” – Barry Devlin, IBM Consultant

“A data warehouse is a database of data gathered from many systems and intended to support management reporting and decision making” – Michael Corey et al, CTO of OneWarranty.com

Page 13: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

13Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

SubjectOriented

Integrated

Time VariantNon Volatile

DataWarehouse

What is a Data Warehouse?What is a Data Warehouse?(classical viewpoint)(classical viewpoint)

What is a Data Warehouse?What is a Data Warehouse?(classical viewpoint)(classical viewpoint)

According to W. H. Inmon (Building a Data Warehouse, 1992)

“A DW is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making.”

Page 14: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

14Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

In a Nutshell, a DW isIn a Nutshell, a DW isIn a Nutshell, a DW isIn a Nutshell, a DW is

A persistent collection of diverse data Generally speaking, an efficient solution to data

integration A single repository of information

Subject-Oriented Organized by subject (not by application) Used for analysis, reporting, data mining, etc.

Structured and optimized differently from transaction-oriented databases

User interface aimed at executive – decision makers

Page 15: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

15Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Data Warehouse HistoryData Warehouse HistoryData Warehouse HistoryData Warehouse History

Page 16: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

16Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Standard DB v. DWStandard DB v. DWStandard DB v. DWStandard DB v. DW

Standard Database Mix of updates and querying

Many small-medium transactions

MBs to GBs in size Most Current snapshot

Heavily indexed

Raw Data Thousands of users (e.g.

clerical to mid-level-mangers)

Data Warehouse Mostly reads (infrequent

updates, append-only – very rarely data is deleted)

Queries are complex and long-running

GBs to TBs in size Not the most current

snapshot/Historical Lots of scans (as data is

readily accessible) Summarized/Aggregated Hundreds of users (e.g.

decision-makers, analysts)

Page 17: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

17Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Architectures (I)Architectures (I)SimpleSimple

Architectures (I)Architectures (I)SimpleSimple

Metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like December sales.

Page 18: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

18Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Architectures (II)Architectures (II)With Staging AreaWith Staging AreaArchitectures (II)Architectures (II)With Staging AreaWith Staging Area

We need to clean and process operational data before putting it into the warehouse. We can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management.

Page 19: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

19Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Architectures (III)Architectures (III)With Staging Area + Data With Staging Area + Data

MartsMarts

Architectures (III)Architectures (III)With Staging Area + Data With Staging Area + Data

MartsMarts

This is a customized warehouse architecture for different groups within an organization. By adding data marts, which are systems designed for a particular line of business, we can build a more customized DW.

Page 20: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

20Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Problems and IssuesProblems and IssuesProblems and IssuesProblems and Issues

Warehouse Maintenance Data sources (DSs) on which a DW is based may change

over time. Changes at DSs may require changes at a DW. How often to propagate changes to a DW?

» At night, weekly/fortnightly/monthly, immediately, etc. How to propagate changes to a DW?

» Completely re-build all affected tables at the DW (easy but inefficient)

» Apply changes to affected tables incrementally (efficient but difficult)

Performance How to assess if a DW is performing well? How to improve performance?

Miscellaneous Issues Data Quality Assurance (How good is data in a DW?) How to cope with data warehouse evolution?

Page 21: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

21Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

Data Systems Supporting DWData Systems Supporting DWData Systems Supporting DWData Systems Supporting DW

Oracle 8i, 9i IBM DB2 Sybase RedBrick Data Warehouse/Informix MS SQL Server Tandem (HP) Teradata MicroStrategy

Page 22: 1 Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing Data Warehousing “An Introduction” Dr. Akhtar Ali School of Computing,

22Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing

BibliographyBibliographyBibliographyBibliography

Advanced Topics in Database Systems by Sharma Chakravarthy, 2001, University of Texas at Arlington, USA.

Oracle9i Data Warehousing Guide Release 2 (9.2), 2002.

Oracle 8i Data Warehousing by Michael Corey, Michael Abbey, Ian Abramson, Ben Taub, Oracle Press, 2001.