Top Banner
41
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Datawarehouse and OLAP
Page 2: Datawarehouse and OLAP

What is Data Warehouse?

A data ware is a Subject orient Integrated Time variant(historic perspective) Nonvolatile

collection of data in support of management’s decision making process(William H Inmon)

Semantically consistent data store that serve as a physical implementation of decision support data model

Page 3: Datawarehouse and OLAP

3

Data Warehouse—Subject-Oriented

Organized around major subjects, such as

customer, product, sales

Focusing on the modeling and analysis of data for

decision makers, not on daily operations or

transaction processing

Provide a simple and concise view around

particular subject issues by excluding data that

are not useful in the decision support process

Page 4: Datawarehouse and OLAP

4

Data Warehouse—Integrated

Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line

transaction records Data cleaning and data integration techniques are

applied. Ensure consistency in naming conventions,

encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.

Page 5: Datawarehouse and OLAP

5

Data Warehouse—Time Variant

The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse Contains an element of time, explicitly or

implicitly But the key of operational data may or may not

contain “time element”

Page 6: Datawarehouse and OLAP

6

Data Warehouse—Nonvolatile

A physically separate store of data transformed

from the operational environment

Operational update of data does not occur in the

data warehouse environment

Does not require transaction processing,

recovery, and concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data

Page 7: Datawarehouse and OLAP

Differences between Operational Database Systems and Data Warehouses

Online Operational Database Systems Perform online transaction and query processing

system Online transaction processing (OLTP) systems Cover the day-to-day operations of an organization (purchasing,inventory,banking,payroll etc..)

Data warehouses Serve users or knowledge workers in the role

of data analysis and decision making Online analytical processing(OLAP)systems

Page 8: Datawarehouse and OLAP

Differences between OLTP and OLAP

Users and system orientation OLTP– customer oriented OLAP-market-oriented

Data contents OLTP–manages current data easily used for decision making OLAP-manages large amount of historic data, provides facilities for

summarization. Database design

OLTP–entity-relationship(ER) data model and application-oriented DB design

OLAP-star or snowflake model and a subject-oriented database design

Page 9: Datawarehouse and OLAP

Differences between OLTP and OLAP(contd..)

View OLTP–manages current data within an

enterprise or department without referring to historic data or data in different organization

OLAP-deal with information from different organization ,integrating information from many data stores

Access patterns OLTP- Mainly of short, atomic transaction OLAP-complex queries

Page 10: Datawarehouse and OLAP

10

OLTP vs. OLAP

OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Page 11: Datawarehouse and OLAP

11

Why a Separate Data Warehouse? High performance for both systems

DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery

Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation

Different functions and different data: missing data: Decision support requires historical data which

operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation,

summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data

representations, codes and formats which have to be reconciled

Page 12: Datawarehouse and OLAP

Data Warehouse: A Multi-Tiered Data Warehouse: A Multi-Tiered ArchitectureArchitecture

Bottom tier Warehouse database server Backend tools &utilities are used to feed data into bottom

layer from other databases contains metadata repository Data are extracted using application program interfaces

known as gateways . Eg.OBDC,OLEDB,JDBC Middle tier

OLAP server, Implemented using 1) relational OLAP(RALOP) 2) multi dimensional

OLAP(MOLAP)

Top tier: front end client layer which contain query and reporting tools, analysis tools and/or data mining tools

Page 13: Datawarehouse and OLAP

Three tier data warehousing architecture

Page 14: Datawarehouse and OLAP

14

Data Warehouse: A Multi-Tiered ArchitectureData Warehouse: A Multi-Tiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

Othersources

Data Storage

OLAP Server

Page 15: Datawarehouse and OLAP

15

Three Data Warehouse Models (Architecture point of view)

Enterprise warehouse collects all of the information about subjects spanning the entire

organization(Either detailed or summarized data in gigabytes , terabyte or beyond) Implemented in mainframes, superservers or parallel architecture

platforms Data Mart

a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart(customer, item, sales) Implemented on low cost department servers that are in Unix\Linux or Windows

based Independent vs. dependent (directly from warehouse) data mart

Virtual warehouse A set of views over operational databases for efficient query

processing Only some of the possible summary views may be materialized

Page 16: Datawarehouse and OLAP

16

Extraction, Transformation, and Loading (ETL)

Back end tools and utilities include the following functions Data extraction

get data from multiple, heterogeneous, and external sources

Data cleaning detect errors in the data and rectify them when possible

Data transformation convert data from legacy or host format to warehouse

format Load

sort, summarize, consolidate, compute views, check integrity, and build indices and partitions

Refresh propagate the updates from the data sources to the

warehouse

Page 17: Datawarehouse and OLAP

17

Metadata Repository

Meta data is the data defining warehouse objects. It stores: Description of the structure of the data warehouse

schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

Operational meta-data data lineage (history of migrated data and transformation path),

currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)

The algorithms used for summarization The mapping from operational environment to the data warehouse Data related to system performance

warehouse schema, view and derived data definitions Business data

business terms and definitions, ownership of data, charging policies

Page 18: Datawarehouse and OLAP

18

From Tables and Spreadsheets to Data Cubes

A data warehouse is based on a multidimensional data model which

views data in the form of a data cube : defined by dimension and fact

A data cube allows data to be modeled and viewed in multiple

dimensions Example: sales with dimension time, branch, Item and location

Dimension tables, such as item (item_name, brand, type), or

time(day, week, month, quarter, year)

Fact table contains numeric measures (such as dollars_sold)

and keys to each of the related dimension tables

In data warehousing literature, an n-D base cube is called a base

cuboid. The top most 0-D cuboid, which holds the highest-level of

summarization, is called the apex cuboid. The lattice of cuboids

forms a data cube.

Page 19: Datawarehouse and OLAP

19

Cube: A Lattice of Cuboids

time,item

time,item,location

time, item, location, supplier

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D (base) cuboid

Page 20: Datawarehouse and OLAP

20

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a

set of dimension tables 1) large central table(fact table) 2) a set of dimension tables

Snowflake schema: A refinement of star schema

where some dimensional hierarchy is normalized into a

set of smaller dimension tables, forming a shape

similar to snowflake

Fact constellations: Multiple fact tables share

dimension tables, viewed as a collection of stars,

therefore called galaxy schema or fact constellation

Page 21: Datawarehouse and OLAP

21

Example of Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystate_or_provincecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Page 22: Datawarehouse and OLAP

22

Example of Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycitystate_or_provincecountry

city

Page 23: Datawarehouse and OLAP

23

Example of Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_statecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Page 24: Datawarehouse and OLAP

24

A Concept Hierarchy: Dimension (location)

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity

Page 25: Datawarehouse and OLAP

25

Data Cube Measures: Three Categories

Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning

E.g., count(), sum(), min(), max()

Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function

E.g., avg(), min_N(), standard_deviation()

Holistic: if there is no constant bound on the storage size needed to describe a subaggregate.

E.g., median(), mode(), rank()

Page 26: Datawarehouse and OLAP

26

Multidimensional Data

Sales volume as a function of product, month, and region

Prod

uct

Regio

n

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Page 27: Datawarehouse and OLAP

27

A Sample Data Cube

Total annual salesof TVs in U.S.A.

Date

Produ

ct

Cou

ntrysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Page 28: Datawarehouse and OLAP

28

Cuboids Corresponding to the Cube

all

product date country

product,date product,country date, country

product, date, country

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D (base) cuboid

Page 29: Datawarehouse and OLAP

29

Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data,

or introducing new dimensions Slice and dice: project and select

Slice: selection of one dimension resulting sub cube .Eg. Time=“Q1” Dice: sub cube selecting 2 or more dimensions Eg: (Location =“Toronto”

or “Vancouver”Time=“Q1” or “Q2”)

Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes

Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end

relational tables (using SQL)

Page 30: Datawarehouse and OLAP

30

Fig. 3.10 Typical OLAP Operations

Page 31: Datawarehouse and OLAP

Starnet Query model

Querying multidimensional DBs Consists of radial lines originated from a

central point Each line represent a concept hierarchy

for dimension Each abstraction level in the hierarchy –

footprint

Page 32: Datawarehouse and OLAP

32

A Star-Net Query Model

Shipping Method

AIR-EXPRESS

TRUCKORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

OrganizationPromotion

CITY

COUNTRY

REGION

Location

DAILYQTRLYANNUALYTime

Each circle is called a footprint

Page 33: Datawarehouse and OLAP

33

Browsing a Data Cube

Visualization OLAP capabilities Interactive manipulation

Page 34: Datawarehouse and OLAP

Data Warehouse design and usage

What goes into a data warehouse design?

How are data warehouses used?

How do data warehousing and OLAP relate to data mining

Page 35: Datawarehouse and OLAP

35

Design of Data Warehouse: A Business Analysis Framework

What can business analysts gain from having a data warehouse? Competitive advantage Productivity Customer relationship management Cost reduction

Effective data warehouse design need to understand and analysis business needs and construct a business analysis framework

Page 36: Datawarehouse and OLAP

Design of Data Warehouse: A Business Analysis Framework(contd..)

Four views regarding the design of a data warehouse Top-down view

allows selection of the relevant information necessary for the data warehouse(information matches current and future business needs)

Data source view exposes the information being captured, stored, and

managed by operational systems.(modeled by E-R model or CASE)

Data warehouse view consists of fact tables and dimension tables with

precalculated totals and counts(source ,date,time) Business query view

sees the perspectives of data in the warehouse from the view of end-user

Page 37: Datawarehouse and OLAP

Design of Data Warehouse: A Business Analysis Framework(contd..)

Building data warehouse is complex and long-term task requires

Business skills Technology skills Program management skills

Implementation scope should be clearly defined. The goals of an initial data warehouse implementation should be

Specific Achievable Measurable

Page 38: Datawarehouse and OLAP

38

Data Warehouse Design Process

Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning

(mature) Bottom-up: Starts with experiments and prototypes

(rapid) From software engineering point of view

Waterfall: structured and systematic analysis at each step before proceeding to the next

Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around

Page 39: Datawarehouse and OLAP

Data Warehouse Design Process (contd..)

Typical data warehouse design process Choose a business process to model, e.g.,

orders, invoices, etc. Choose the grain (atomic level of data) of the

business process(eg. Individual transactions) Choose the dimensions that will apply to each

fact table record. Typical dimensions are time,item,customer etc..

Choose the measure that will populate each fact table record eg. Dollars_sold or units_sold

Page 40: Datawarehouse and OLAP

Data warehouse usage for information Processing

Information processing Supports query processing , statistical analysis

and reporting Analytical processing

Support basic OLAP operations, including slice-and-dice, drill-down, roll-up and pivoting

Data mining Support knowledge discovery by finding hidden

patterns and association, constructing analytical models, performing classification and predication and presentation

Page 41: Datawarehouse and OLAP

41

From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)

Integrate OLAP with data mining to uncover knowledge in multidimensional databases .

Why online analytical mining? High quality of data in data warehouses

DW contains integrated, consistent, cleaned data Available information processing structure surrounding

data warehouses ODBC, OLEDB, Web accessing, service facilities,

reporting and OLAP tools OLAP-based exploratory data analysis

Mining with drilling, dicing, pivoting, etc. On-line selection of data mining functions

Integration and swapping of multiple mining functions, algorithms, and tasks