Top Banner
Data Warehousing
23

Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Jan 03, 2016

Download

Documents

Matilda Hawkins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Data Warehousing

Page 2: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Definition• Data Warehouse:

– A subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes

– Subject-oriented: e.g. customers, patients, students, products

– Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources

– Time-variant: Contain a time dimenstion so that it may be used to study trends and changes

– Nonupdatable: Read-only, periodically refreshed

• Data Mart:– A data warehouse that is limited in scope

Page 3: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Need for Data Warehousing• Integrated, company-wide view of high-quality

information (from disparate databases)• Separation of operational and informational (decision

support) systems and data (for improved performance)

Page 4: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Data Warehouse Architectures

• Generic Two-Level Architecture

• Independent Data Mart

All involve some form of extraction, transformation and loading (ETLETL)

Page 5: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-2: Generic two-level data warehousing architecture

E

T

LOne, company-wide warehouse

Periodic extraction data is not completely current in warehouse

Page 6: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-3 Independent data mart data warehousing architecture

Data marts:Data marts:Mini-warehouses, limited in scope

E

T

L

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

Page 7: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

The ETL Process

• Capture/Extract• Scrub or data cleansing• Transform:

– Convert data from the format of the source to the format of the data warehouse.

• Load and Index

ETL = Extract, transform, and load

Page 8: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode:Refresh mode: bulk rewriting of target data at periodic intervals

Update mode:Update mode: only changes in source data are written to data warehouse

Figure 11-10: Steps in data reconciliation

(cont.)

Page 9: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Index

• Bitmap index

• Join index

Page 10: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 6-8Bitmap index index organization

Bitmap saves on space requirementsRows - possible values of the attribute

Columns - table rows

Bit indicates whether the attribute of a row has the values

Page 11: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 6-9 Join Indexes–speeds up join operations

Page 12: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Star Schema for Data Warehouse

• Objectives– Ease of use for decision support applications– Fast response to predefined user queries– Customized data for particular target audiences

Also called “dimensional model”• Dimension:

– A dimension is a term used to describe any category used in analyzing data, such as time, geography, and product line.

Page 13: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-13 Components of a star schemastar schemaFact tables contain factual or quantitative data

Dimension tables contain descriptions about the subjects of the business

1:N relationship between dimension tables and fact tables

Excellent for ad-hoc queries, but bad for online transaction processing

Dimension tables are denormalized to maximize performance

Page 14: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-14 Star schema example

Fact table provides statistics for sales broken down by product, period and store dimensions

Page 15: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-15 Star schema with sample data

Page 16: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

On-Line Analytical Processing (OLAP) Tools

• The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques

• Relational OLAP (ROLAP)– Traditional relational representation

• Multidimensional OLAP (MOLAP)– Cube structure

• OLAP Operations– Cube slicing–come up with 2-D view of data– Drill-down–going from summary to more detailed views

Page 17: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-23 Slicing a data cube

Page 18: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Figure 11-24 Example of drill-down

Summary report

Drill-down with color added

Starting with summary data, users can obtain details for particular cells

Page 19: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Data Mining and Visualization• Knowledge discovery using a blend of statistical, AI, and computer graphics

techniques• Goals:

– Explain observed events or conditions– Confirm hypotheses– Explore data for new or unexpected relationships

• Techniques– Statistical regression– Decision tree induction– Clustering and signal processing– Affinity– Sequence association– Case-based reasoning– Rule discovery– Neural nets– Fractals

• Data visualization–representing data in graphical/multimedia formats for analysis

Page 20: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

Pivot Table

• Excel:– Drill Down, Roll Up

• Access CrossTab query

Page 21: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

SQL GROUPING SETS

• GROUPING SETS– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS

– GROUP BY GROUPING SETS(CITY,RATING,(CITY,RATING),())

– ORDER BY CITY;

• Note: () indicates that an overall total is desired.

Page 22: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

SQL CUBE

• Perform aggregations for all possible combinations of columns indicated.– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS

– GROUP BY CUBE(CITY,RATING)

– ORDER BY CITY, RATING;

Page 23: Data Warehousing. Definition Data Warehouse: –A subject-oriented, integrated, time-variant, non- updatable collection of data used in support of management.

SQL ROLLUP

• The ROLLUP extension causes cumulative subtotals to be calculated for the columns indicated. If multiple columns are indicated, subtotals are performed for each of the columns except the far-right column.– SELECT CITY,RATING,COUNT(CID) FROM HCUSTOMERS– GROUP BY ROLLUP(CITY,RATING)– ORDER BY CITY, RATING