DW_Intro

IT-357: Data-Warehousing

Motivation for Data Warehousing

• We have mountains of data in this company, but we can’t access it.

• You’ve got to make it easy for business people to get at the data directly.

• Just show me what is important.• It drives me crazy to have two people present

the same business metrics at a meeting, but with different numbers.

• We want people to use information to support more fact-based decision making.

• We need to “slice and dice” the data every which way.

Motivation for Data Warehousing• The data warehouse must:

– Make an organization’s information easily accessible

– Present the organization’s information consistently

– Be adaptive and resilient to change– Be secure such that protects our information

assets– Serve as the foundation for improved decision

making– Must ensure the business community accept

the data warehouse if it is to be deemed successful

Typical Queries on DW

• What was the total number of Cell Phones sold in India in 2013 group by companies?

• What was the total revenue for property sales for each type of property in Mangalore between 2006 and 2008?

• What would be the effect on cell phone sales in the Mangalore if a new college is opened?

• Which type of Cell Phone sells most in Mangalore?

• Which is the most travelled train in India in 2013?

Data Warehouse• Defined in many different ways, but not rigorously.

– A decision support database that is maintained separately from

the organization’s operational database

– Support information processing by providing a solid platform of

consolidated, historical data for analysis.

• “A data warehouse is a

– subject-oriented

– integrated

– time-variant

– nonvolatile

W. H. Inmon

Ralph Kimball

DW – Subject Oriented

• Organized around major subjects, such as customer, product, sales

• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

DW - Integrated• Constructed by integrating multiple,

heterogeneous data sources– relational databases, flat files, on-line transaction

records• Data cleaning and data integration techniques

are applied.– Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data sources

• E.g., Hotel price: currency, tax, food products, etc.

– When data is moved to the warehouse, it is converted.

DW – Time Variant• The time horizon for the data warehouse is

significantly longer than that of operational systems– Operational database: current value data

– Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

• Every key structure in the data warehouse– Contains an element of time, explicitly or implicitly

– But the key of operational data may or may not contain “time element”

DW – Non-volatile• A physically separate store of data transformed

from the operational environment

• Operational update of data does not occur in the data warehouse environment– Does not require transaction processing, recovery,

and concurrency control mechanisms

– Requires only two to three operations in data accessing:

• initial loading, incremental loading of data and access of data

Difference between OLTP and OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date

detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc/repetitive access read/write

index/hash on prim. key lots of scans

unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response

A Typical Data Warehouse

DW Building Operations (ETL)• Data extraction

– Get data from multiple, heterogeneous, and external sources

• Data cleaning– Detect errors in the data and rectify them when

possible• Data transformation

– Convert data from legacy or host format to warehouse format

• Load/Refresh– Sort, summarize, consolidate, compute views, check

integrity, and build indexes and partitions– Propagate updates from sources

Who does not need a Data Warehouse

• If the business does not need a unified set of information/unified view– If the same person is involved in research, manufacturing, sale,

support, relationship• If there is little or no latency time in the access and

analysis of information• If the same product is sold over and over• If there is no need for historical data

• Family businesses

• Small businesses

Structure - Data Warehouse

What is Meta Data• Meta data is the data defining warehouse objects (re-usable). It

stores:

• Description of the structure of the data warehouse– Schema, view, dimensions, hierarchies, derived data definition, data

mart locations and contents

• Operational meta-data– History of migrated data, currency of data (active, archived, or purged),

algorithms used for queries

• The mapping from operational environment to the data warehouse• Data related to system performance

– Number of users, system checks performed, failures

• Business data– Business terms and definitions, ownership of data, policies (scope of

DW, security)

Data Warehouse Meta Data • Metadata for the data warehouse environment is

one of the most important aspects. Metadata helps the DSS analyst find what data is in the warehouse and use that data effectively and efficiently.

• Some of the components of data warehouse metadata are:

• The structure and contents of the warehouse• The mapping of data into the data warehouse• The extract/transformation history• Aging purging criteria• Ownership/stewardship information

Uses of Metadata

• Some of the uses– Extraction and loading processes - metadata is used

to map data sources to a common view of information within the warehouse

– Warehouse management process - metadata is used to automate the production of summary tables

– Query management process - metadata is used to direct a query to the most appropriate data source

RDBMS(OLTP) Vs DW (OLAP) StructureRegionReg_ID Reg_Name

1 Europe

2 North America

3 Asia

LocationRegion Country CityEurope Germany FrankfurtEurope Spain MadridNorth America Canada VancouverNorth America Mexico Mexico CityAsia India DelhiAsia China BeijingAsia India Mumbai

CountryCntry_ID Reg_ID Cntry_Name

1 1 Germany2 1 Spain3 2 Canada4 2 Mexico5 3 India6 3 China

CityCity_ID Reg_ID Cntry_ID City_Name

1 1 1 Frankfurt2 2 2 Vancouver3 2 3 Toronto4 2 4 Mexico City5 3 5 Delhi6 3 6 Beijing7 3 5 Mumbai8 1 2 Madrid

Concept Hierarchy

SQL for OLTP and OLAP• OLTP:

select * from region, country, city where region.reg_id = country.reg_id and country.reg_id = city.reg_id

• OLAP:– select * from location

• OLTP for quick Insert/Update• OLAP for quick Data Access

Data Warehouse Usage

• Three kinds of data warehouse applications– Information Processing

• supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

– Analytical Processing

• multidimensional analysis of data warehouse data

• supports basic OLAP operations: slice-dice, drill down, roll-up

– Data Mining

• knowledge discovery from hidden patterns

• supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools

DW Queries – Complexity• Complexity by just adding a maximum price:

• Query 1: A simple data cube query: Find the total sales in 2004, broken down by product, region, and month, with subtotals for each dimension.

• Query 2: A complex data cube query: Grouping by all subsets of product, region, month, find the maximum price in 2004 for each group and the total sales among all maximum price tuples

OLAP and Data Mining• Is OLAP data mining? NO

• OLAP is a way to look at pre-aggregated query results (Evaluate query and results)

• Data Mining is building models of data

• Data mining tools model data and return actionable rules

• Practically speaking: OLAP tools can be used on Data Cubes as well as perform some form of Data Mining

Types of Data Warehouse Models• Various Type of FACTS/DIMENSION Tables

– Star schema: A fact table in the middle connected to a set of dimension tables

– Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

– Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars,

therefore called galaxy schema or fact constellation

Star Schema

Note the concept hierarchy within the Dimension tables. WHY?

Snowflake Schema

Advantages and Disadvantages?

FACT Constellation

Indexes for Data Warehouse

• Different types between OLTP and OLAP

• Ordinary Index • Bitmap Index• JOIN Index

Ordinary Index

• Surrogate Key – Most cases it is not a Natural key in addition

to the Business key. – Represents an object in the database, but not

visible outside• B-Tree (B Plus Tree)• Must be used with Unique or High

Cardinal Data (Near Unique)

B Plus Tree

How to make the query make less number of Disk accesses?

Bitmap Index

• Index on a particular column• Each value in the column has a bit vector• The length of the bit vector: # of records in the

base table

Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer

RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1

RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0

Base table Index on Region Index on Type

Usage of Bitmap Index

• Static Tables – Data is not updated frequently. – Modification of Bitmap Index is expensive– They are compressed index type

• DSS Systems– Generally suited for low cardinal values (but

need not be limited to)– Suited for systems which gets changes during

non-peak business hours

JOIN Index• In data warehouses, join index relates the

values of the dimensions of a start schema to rows in the fact table

• Eg: Sales and two dimensions city and product– A join index on city maintains for each

distinct city a list of R-IDs (Prim key) of the tuples recording the Sales in the city

• Join indices can span multiple dimensions

Join Index

Size of a Data Warehouseand Query Cost

• Number of Tuples/Records in DIMENSIONS

• Number of Tuples/Records in FACT– Eg: Customer/Sale data

• Cost of a query to locate a tuple using an Index?

Optimizing Data Warehouse

1. Better understanding of the actual data2. Better design (proper de-normalization)3. Indexing:

– Faster access4. Partitioning:

– Physical partitioning– Eg: Partitioning Date/Time Dimension

Data Cube

»Data Cube

Data Cube

• The key operation of a OLAP is the formation of a data cube

• Pre-computed query result

• A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts

• A data cube is a multidimensional representation of data, together with all possible aggregates

A Spreadsheet Data

Date Location Product Sales

1-Jan-13 USA TV 100

2-Jan-13 Canada TV 250

3-Jan-13 Mexico TV 300

4-Jan-13 Brazil TV 200

1-Jan-13 USA PC 50

2-Jan-13 Canada PC 70

3-Jan-13 Mexico PC 40

4-Jan-13 Brazil PC 60

Represent in a 3-Dimension

• Consider previous sales of products at a number of locations at various dates

• This data can be represented as a 3 dimensional array

A Sample Data CubeTotal annual salesof TVs in U.S.A.Date

Produ

ct

Cou

ntrysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4QtrAmericas

Asia Pacific

Europe

sum

Data Cube Browsing (Tools In Practice)

Data Cube: Lattice of Cuboids – 3D

All (highest level of summarization)

product date country

product,date product,country date, country

product, date, country (lowest level of summarization)

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D (base) cuboid

A Simple Representation

• Base and Aggregate cells. – Consider the data cube with the DIMENSION

Date, Product, County and the FACT Quantity.

– 1D Cells: (Jan, *, *, 350) – 1D Cells: (Feb, *, *, 50)

– 2D Cells: (Jan, * , Mexico, 70) – 3D Cells: (Aug, TV, USA, 80)

Data Cube: Lattice of Cuboids – 4D

All (highest level of summarization)

time,product

time,product,location

time, product, location, supplier (lowest level of summarization)

time product location supplier

time,location

time,supplier

product,location

product,supplier

location,supplier

time,product,supplier

time,location,supplier

product,location,supplier

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D (base) cuboid

Bottom Up Cube Calculation (BUC)

• 4 Dimension Computation (A, B, C, D attributes)–1D Cells: (Jan, *, *,*, 500)–2D Cells: (*, TV, *,Philips, 300)

The “Compute Cube” Operator• Cube definition and computation in DMQL

define cube sales [product, city, year]: sum (sales_in_dollars)

compute cube sales

• Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)

SELECT product, city, year, SUM (amount)FROM SALESCUBE BY product, city, year

• Need compute the following Group-Bys(date, product, city),(date,product),(date, city), (product, city),(date), (product), (city)()

(product)(city)

()

(year)

(city, product) (city, year) (product, year)

(city, product, year)

Oracle’s “CUBE” Operator• CUBE extension will generate subtotals for all

combinations of the dimensions specified

• As the number of dimensions increase, so do the combinations of subtotals that need to be calculated

• SQL: SELECT Date_Id, Product_Id, City_id, SUM(sales_value) AS sales_value FROM dimension_tab GROUP BY CUBE (Date_Id, Product_Id, City_id) ORDER BY Date_Id, Product_Id, City_id;

Attribute Oriented Induction• Proposed in 1989 [Before Data Cube concept was

introduced]• How it is done?

– Collect the task-relevant data (data for analysis) using a relational database query (initial relation)

– Perform generalization by attribute removal or attribute generalization

– Apply aggregation by merging identical, generalized tuples and accumulating their respective counts

– Interaction with users for knowledge presentation

Example for Attribute Oriented Induction• A DMQL Query

1. Use DB_XYZ mine characteristics as “Science Students”

in relevance to name, gender, major, birth place, birth date, residence, phone#, gpa

from studentwhere status in “graduate”

Algorithm to solve the Example2. Remove/Generalizing Attributes:

– Large set of distinct values• Case 1: EITHER there is no concept hierarchy defined

within the attribute• Case 2: OR Higher level concepts are defined in terms of

other attributes

– Remove the attribute when any one of them are true.

3. Class Characterization:– Generalization will result in groups of identical

tuples– Count the number of generalized (duplicate) tuples

and mark the count

Class CharacterizationName Gender Major Birth-Place Birth_date Residence Phone # GPA

JimWoodman

M CS Vancouver,BC,Canada

8-12-76 3511 Main St.,Richmond

687-4598 3.67

ScottLachance

M CS Montreal, Que,Canada

28-7-75 345 1st Ave.,Richmond

253-9106 3.70

Laura Lee…

F…

Physics…

Seattle, WA, USA…

25-8-70…

125 Austin Ave.,Burnaby…

420-5232…

3.83…

Removed Retained Sci,Eng,Bus

Country Age range City Removed Excl,VG,..

Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … …

Importance of Understanding the Data

• The Developer/Tech Analyst/Bus Analyst must know the data well (very)

• Unlike Relational databases where design, efficiency of data retrieval, GUI etc (Although data too is important)

• The data and how the data is stored and the data quality become more and more important in OLAP databases/Data Warehouses

References• Data Mining: Concepts and Techniques

– Jiawei Han, Micheline Kamber, Jian Pei

• Building the Data Warehouse – William Inmon

• The Data Warehouse Toolkit– Ralph Kimball

DW_Intro

Documents

historical data

analysis of data

data warehousing

mountains of data

data warehouse environment

key of operational data

different data sources

incremental loading