An Introduction to Data Warehousing

An Introduction to Data An Introduction to Data Warehousing Warehousing An Introduction to Data An Introduction to Data Warehousing Warehousing

Anand DeshpandeAnand Deshpande

Persistent Systems Pvt. Ltd.Persistent Systems Pvt. Ltd.

http://www.persistent.co.inhttp://www.persistent.co.in

Optimizing the Warehouse Optimizing the Warehouse for Decision Supportfor Decision SupportOptimizing the Warehouse Optimizing the Warehouse for Decision Supportfor Decision Support

IIT DB Class 3

Data -- Heart of the Data Warehouse• Heart of the data warehouse is the data itself!• Single version of the truth• Corporate memory• Data is organized in a way that represents

business -- subject orientation

IIT DB Class 4

Consider a Retail Sales Example• A retail chain sells products in retail stores.• We want to track the effect of promotions on

retail sales.

• Product Table (60,000 SKUs)• Store Table (500 stores)• Promotion (3000 promotions per week)• Sales (2 Billion)

IIT DB Class 5

• What is the total dollar sales and the total dollar costs of all candy sold in supermarket stores on Saturdays?

SELECT p.category, sum(f.dollar_sales),sum(f.dollar.cost)

FROM sales_fact f, product p, time t, store s

WHERE f.product_key = p.product_key

and f.time_key = t.time_keyand f.store_key = s.store_key

and p.category = ‘Candy’

and t.day_of_week = ‘Saturday’and s.floor_plan_type = ‘Super_Market”GROUP BY p.category

IIT DB Class 6

time_keySQL_dateday_of_weekmonthfiscal_periodseason

store_keystore_IDstore_nameaddressregiondivisionfloor_plan_type

product_keySKUdescriptionbrandcategorydepartmentpackage_typesize… etc.

time_keyproduct_keystore_keypromotion_keydollars_soldunits_solddollars_cost

promotion_keypromotion_nameprice_typead_typedisplay_type… etc.

Store Dimension

Time Dimension

Product Dimension

Promotion Dimension

Sales FactTable

IIT DB Class 7

• SELECT NON-AGGREGATE FIELDNAME1,NON-AGGREGATE FIELDNAME2,SUM(AGGREGATE FIELDNAME3)SUM(AGGREGATE FIELDNAME4)SUM(AGGREGATE FIELDNAME5)

FROMDIMENSION TABLE1DIMENSION TABLE2DIMENSION TABLE3DIMENSION TABLE4FACT TABLE

WHEREJOINCONDITION1JOINCONDITION2 ANDJOINCONDITION3 ANDJOINCONDITION4 ANDDIMENSIONCONSTRAINT1 ANDDIMENSIONCONSTRAINT2 ANDDIMENSIONCONSTRAINT3 ANDDIMENSIONCONSTRAINT4 ANDDIMENSIONCONSTRAINT5 ANDDIMENSIONCONSTRAINT6

GROUP BYNON-AGGREGATE FIELDNAME1NON-AGGREGATE FIELDNAME2

ORDER BYNON-AGGREGATE FIELDNAME1NON-AGGREGATE FIELDNAME2

IIT DB Class 8

A Real Data Warehouse Query

Dwquery.txt

IIT DB Class 9

Schema Design

• Database organization– must look like business– must be recognizable by business user– approachable by business user– Must be simple

• Schema Types– Star Schema– Fact Constellation Schema– Snowflake schema

IIT DB Class 10

Dimension Tables

• Dimension tables– Define business in terms already familiar to users– Wide rows with lots of descriptive text– Small tables (about a million rows) – Joined to fact table by a foreign key– heavily indexed– typical dimensions

• time periods, geographic region (markets, cities), products, customers, salesperson, etc.

IIT DB Class 11

Fact Table

• Central table– mostly raw numeric items– narrow rows, a few columns at most– large number of rows (millions to a billion)– Access via dimensions

IIT DB Class 12

Star Schema

• A single fact table and for each dimension one dimension table

• Does not capture hierarchies directly

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

IIT DB Class 13

Snowflake schema

• Represent dimensional hierarchy directly by normalizing tables.

• Easy to maintain and saves storage

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

region

IIT DB Class 14

Fact Constellation

• Fact Constellation– Multiple fact tables that share many dimension

tables– Booking and Checkout may share many

dimension tables in the hotel industry

Hotels

Travel Agents

Promotion

Room Type

Customer

Booking

Checkout

IIT DB Class 15

Data Warehouse Structure

• Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables

IIT DB Class 16

Data Granularity in Warehouse• Summarized data stored

– reduce storage costs– reduce cpu usage– increases performance since smaller number of

records to be processed– design around traditional high level reporting

needs– tradeoff with volume of data to be stored and

detailed usage of data

IIT DB Class 17

Granularity in Warehouse

• Can not answer some questions with summarized data– Did Anand call Seshadri last month? Not possible

to answer if total duration of calls by Anand over a month is only maintained and individual call details are not.

• Detailed data too voluminous

IIT DB Class 18

Granularity in Warehouse• Tradeoff is to have dual level of granularity

– Store summary data on disks• 95% of DSS processing done against this data

– Store detail on tapes• 5% of DSS processing against this data

IIT DB Class 19

Vertical Partitioning

Product_key SKU description brand category department package_type size… etc.

. . .

Product_key SKU

Product_key brand category department . . .

Frequentlyaccessed

Rarely accessed

Smaller tableand so less I/O

IIT DB Class 20

Derived Data• Introduction of derived (calculated data) may

often help• Have seen this in the context of dual levels of

granularity • Can keep auxiliary views and indexes to

speed up query processing

IIT DB Class 21

Denormalization• Normalization in a data warehouse may lead

to lots of small tables• Can lead to excessive I/O’s since many

tables have to be accessed• Denormalization is the answer especially

since updates are rare

IIT DB Class 22

Creating Arrays

• Many time each occurrence of a sequence of data is in a different physical location

• Beneficial to collect all occurrences together and store as an array in a single row

• Makes sense only if there are a stable number of occurrences which are accessed together

• In a data warehouse, such situations arise naturally due to time based orientation

» can create an array by month

IIT DB Class 23

Selective Redundancy• Description of an item can be stored

redundantly with order table -- most often item description is also accessed with order table

• Updates have to be careful

IIT DB Class 24

Partitioning• Breaking data into several

physical units that can be handled separately

• Not a question of whether to do it in data warehouses but how to do it

• Granularity and partitioning are key to effective implementation of a warehouse

IIT DB Class 25

Why Partitioning?• Flexibility in managing data• Smaller physical units allow

– easy restructuring– free indexing– sequential scans if needed– easy reorganization– easy recovery– easy monitoring

IIT DB Class 26

Criterion for Partitioning• Typically partitioned by

– date– line of business– geography– organizational unit– any combination of above

IIT DB Class 27

Where to Partition?• Application level or DBMS level• Makes sense to partition at application level

– Allows different definition for each year• Important since warehouse spans many years and as

business evolves definition changes

– Allows data to be moved between processing complexes easily

IIT DB Class 28

Where to Partition?• Application level or DBMS level• Makes sense to partition at application level

– Allows different definition for each year• Important since warehouse spans many years and as

business evolves definition changes

– Allows data to be moved between processing complexes easily

IIT DB Class 29

Indexing Techniques

• Bitmap index:– A collection of bitmaps -- one for each distinct

value of the column– Each bitmap has N bits where N is the number of

rows in the table– A bit corresponding to a value v for a row r is set if

and only if r has the value for the indexed attribute

IIT DB Class 30Customer Query : select * from customer wheregender = ‘F’ and vote = ‘Y’

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

Bitmap Index

M

F

F

F

F

M

Y

Y

Y

N

N

N

IIT DB Class 31

Join Indexes

• Pre-computed joins• A join index between a fact table and a

dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute– e.g., a join index on city dimension of calls fact

table– correlates for each city the calls (in the calls table)

that originated from that city

IIT DB Class 32

Join Indexes• Join indexes can also span multiple

dimension tables– e.g., a join index on city and time dimension of calls fact table

IIT DB Class 33

Star Join Processing

• Use join indexes to join dimension and fact table

CallsC+T

C+T+L

C+T+L+P

Time

Loca-tion

Plan

IIT DB Class 34

Optimized Star Join Processing

Time

Loca-tion

Plan

Calls

Virtual Cross Productof T, L and P

Apply Selections

IIT DB Class 35

Bitmapped Join Processing

AND

Time

Loca-tion

Plan

Calls

Calls

Calls

Bitmaps101

001

110

IIT DB Class 36

Intelligent Scan

• Piggyback multiple scans of a relation (Redbrick)– piggybacking also done if second scan starts a

little while after the first scan

IIT DB Class 37

Parallel Query Processing

• Three forms of parallelism– Independent– Pipelined– Partitioned and “partition and replicate”

• Deterrents to parallelism– startup – communication

IIT DB Class 38

Parallel Query Processing

• Partitioned Data– Parallel scans– Yields I/O parallelism

• Parallel algorithms for relational operators– Joins, Aggregates, Sort

• Parallel Utilities– Load, Archive, Update, Parse, Checkpoint, Recovery

• Parallel Query Optimization

IIT DB Class 39

Pre-computed Aggregates

• Keep aggregated data for efficiency (pre-computed queries)

• Questions– Which aggregates to compute?– How to update aggregates?– How to use pre-computed aggregates in queries?

IIT DB Class 40

Pre-computed Aggregates• Aggregated table can be maintained by the

– warehouse server– middle tier – client applications

• Pre-computed aggregates -- special case of materialized views -- same questions and issues remain

IIT DB Class 41

Summary Management

Operational Data

Extract Incremental

Details

Transforms

Details and

Aggregates

Staging File

Query Rewrite

Extraction MDDB

Analysis Tools

Incremental Load

and Refresh

IIT DB Class 42

SQL Extensions

• Extended family of aggregate functions– rank (top 10 customers)– percentile (top 30% of customers)– median, mode– Object Relational Systems allow addition of new

aggregate functions

IIT DB Class 43

SQL Extensions• Reporting features

– running total, cumulative totals

• Cube operator– group by on all subsets of a set of attributes

(month,city)– redundant scan and sorting of data can be

avoided

IIT DB Class 44

Server Scalability

• Scalability is the #1 IT requirement for Data Warehousing

• Hardware Platform options– SMP– Clusters (shared disk)– MPP

• Loosely coupled (shared nothing)• Hybrid

IIT DB Class 45

SMP Characteristics

• SMP -- Symmetric multi processing -- shared

everything

• Multiple CPUs share same memory

• Workload is balanced across CPUs by OS

• Scalability is limited to bandwidth of internal bus and

OS architecture

• Not tolerant to failure in processing node

• Architecture is mostly invisible to applications

IIT DB Class 46

SMP Benefits• Lower entry point -- can start with SMP• Mature technology

IIT DB Class 47

MPP Characteristics

• Each node owns a portion of the database• Nodes are connected via an interconnection

network• Each node can be a single CPU or SMP• Load balancing done by application• High scalability due to local processing

isolation

IIT DB Class 48

MPP benefits• High availability• High scalability

Viewing the Data with Viewing the Data with OLAPOLAPViewing the Data with Viewing the Data with OLAPOLAP

Making Decision Support PossibleMaking Decision Support Possible

IIT DB Class 50

Limitations of SQL

“A Freshman in

Business needs a

Ph.D. in SQL”

-- Ralph Kimball

IIT DB Class 51

Typical OLAP Queries• Write a multi-table join to compare sales for each product line

YTD this year vs. last year.

• Repeat the above process to find the top 5 product contributors

to margin.

• Repeat the above process to find the sales of a product line to

new vs. existing customers.

• Repeat the above process to find the customers that have had

negative sales growth.

IIT DB Class 52

* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html

What Is OLAP?

• Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software*

• Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System

• OLAP = Multidimensional Database• MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express)• ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS

Agent)

IIT DB Class 53

Strengths of OLAP• It is a powerful visualization paradigm

• It provides fast, interactive response times

• It is good for analyzing time series

• It can be useful to find some clusters and outliners

• Many vendors offer OLAP tools

IIT DB Class 54MonthMonth1 1 22 3 3 4 4 776 6 5 5

Pro

du

ctP

rod

uct

Toothpaste Toothpaste

JuiceJuiceColaColaMilk Milk

CreamCream

Soap Soap

Regio

n

Regio

n

WWS S

N N

Dimensions: Dimensions: Product, Region, TimeProduct, Region, TimeHierarchical summarization pathsHierarchical summarization paths

Product Product Region Region TimeTimeIndustry Country YearIndustry Country Year

Category Region Quarter Category Region Quarter

Product City Month WeekProduct City Month Week

Office DayOffice Day

Multi-dimensional Data

• “Hey…I sold $100M worth of goods”

IIT DB Class 55

Visualizing Neighbors is simpler 1 2 3 4 5 6 7 8 Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar

Month Store Sales Apr 1 Apr 2 Apr 3 Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 May 1 May 2 May 3 May 4 May 5 May 6 May 7 May 8 Jun 1 Jun 2

IIT DB Class 56

“Slicing and Dicing”

Product

Sales Channel

Regio

ns

Retail Direct Special

Household

Telecomm

Video

Audio IndiaFar East

Europe

The Telecomm Slice

IIT DB Class 57

Roll-up and Drill Down

• Sales Channel• Region• Country• State • Location Address• Sales Representative

Roll

Up

Higher Level ofAggregation

Low-levelDetails

Drill-D

ow

n

IIT DB Class 58

Nature of OLAP Analysis

• Aggregation -- (total sales, percent-to-total)

• Comparison -- Budget vs. Expenses• Ranking -- Top 10, quartile analysis• Access to detailed and aggregate

data• Complex criteria specification• Visualization

IIT DB Class 59

Organizationally Structured Data• Different Departments look at the same detailed data

in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data

marketing

manufacturing

sales

finance

IIT DB Class 60

Multidimensional Spreadsheets• Analysts need spreadsheets that

support– pivot tables (cross-tabs)– drill-down and roll-up– slice and dice– sort– selections– derived attributes

• Popular in retail domain

IIT DB Class 61

SQL Extensions

• Front-end tools require– Extended Family of Aggregate Functions

• rank, median, mode

– Reporting Features• running totals, cumulative totals

– Results of multiple group by• total sales by month and total sales by product

– Data Cube

Red Brick Red Brick FormationFormation™™Red Brick Red Brick FormationFormation™™

Extraction, Extraction,

Transformation, and Transformation, and

Data LoadingData Loading

IIT DB Class 63

CorporateData

Red Brick Red Brick Data MineData Mine

Red Brick Red Brick WarehouseWarehouse

RDBMSRDBMS

Red Brick Red Brick High PerformanceHigh Performance

LoaderLoader

PurchasedData

End User Query

Data Mining

ExtractExtract

IntegrateIntegrate

TransformTransform

CleanseCleanse

TransportTransport

LoadLoad

Red Brick Formation

Services

Products

Methodology

Partnerships

The Red BrickDecisionScape Environment

IIT DB Class 64

Loading the Data Warehouse

• Data extractionData transformationData loading is…

– 40% of total cost

– 80% of total time and effort

IIT DB Class 65

It’s All About Data Transformation• Extensive Data Manipulation:

– Integrating data from dissimilar sources – Cleansing source data– Creating and storing aggregates & summaries– Deriving new data– Modifying existing data– Adapting to change in the business– Capturing and storing metadata (documentation)

IIT DB Class 66

Traditional “Solutions”• What are they?

– 3GL (COBOL, BASIC, PL1, C, etc)– 4GL (SAS, FOCUS, EASYTRIEVE, etc)

• Why use them?– Availability of programming staff– Known technology– “I can do it better” attitude of in-house staff

• Why you shouldn’t use them:– Little to no metadata– Expensive to develop and maintain in an iterative

development environment

• The Better Choice … A Data Extraction/Transformation Tool

IIT DB Class 67

Red Brick Formation:A Better Choice

• Model-driven transformation results in:

• Development productivity:– 2x more productive than hand-coding.– Less experience required to create jobs.

• Maintenance productivity:– 8-10x more productive than hand-coding.– Extensive code re-use– Self-documenting at every step.

IIT DB Class 68

Red Brick Formation Key Product Highlights

• Flexible and Easy to Use– Visual data flow diagramming with

optimized, pre-built operators

• Scalability & Performance– Generates standard C++ code– Runs on Windows NT and UNIX– Single pass handles multiple sources and targets – Intelligent parallelization and synchronization

• Extensibility & ROI– 3rd-party integration– Productive use of people resource

• Added Bonus – Integration with Red Brick Warehouse

IIT DB Class 69

Red Brick Formation Key Product Features/Benefits

• Extensive Pre-built, Optimized Operators & Functions

• Formation’s unique coding of transformation rules via Visual Snippets -- makes development & maintenance easy

• Job performance is enhanced and job development more standardized by utilizing pre-built operators and functions

• Visual, Process-Oriented Model for Designing & Maintaining Jobs

• Single Pass Processing of Multiple, Heterogeneous Sources & Targets

• Scalable Client/Server Architecture

• Single pass processing means that less steps and less I/O overhead leading to more efficient processing of data

• Scalable architecture means jobs designed for pilot or prototype will also work within “assigned windows” to process production volume of data

IIT DB Class 70

Red Brick FormationVisual, process-oriented modelRed Brick FormationVisual, process-oriented model

IIT DB Class 71

Red Brick FormationAny Number of Data SourcesRed Brick FormationAny Number of Data Sources

• Any heterogeneous mix of – Red Brick Warehouse– Flat Files (Fixed and Delimited)– RDBMS (Oracle, Microsoft SQLServer)

Red BrickFormation

IIT DB Class 72

Red Brick FormationRobust Data Transformation Operators

Red Brick FormationRobust Data Transformation Operators

• Operators– Aggregate– Cursor– Deduplicate– Filter– Partition– Gather– Group By– Household– Join– Advanced Join– Cross Product Join

– First Normal– Program– Sort– Split– Union– File Import– File Export– Red Brick Import– Red Brick Export– MS SQLServer Import– Oracle Import– Oracle Export

IIT DB Class 73

Red Brick FormationMore than 200 Built-in FunctionsRed Brick FormationMore than 200 Built-in Functions

• Data Types– Integer, Unsigned Integer

– Float, Double, Decimal

– Date, Time, Timestamp, Interval

– Text (Fixed and Variable) and BLOB

• Functions– Math (Add, Subtract, Multiply, Divide, Power, Square Root, Absolute Value, Max, Min and more)

– Comparison (EQ, NE,GT,GE,LT,LE)

– Logical (And, Or, Not)

– Text (Search and Compare, Concatenations, Substring, Upcase, Downcase, and more)

– Data Type Conversions (Implicit & Explicit)

IIT DB Class 74

Red Brick FormationMajor ComponentsRed Brick FormationMajor Components

Client / Server Implementation

• Client: Formation Architect and Integrated Metadata Facility– Intel/Windows NT

• Server: Formation Flow Engine

– Intel/Windows NT– HP-UX– Sun Solaris– Compaq’s Digital Unix– IBM AIX

IIT DB Class 75

Red Brick Formation Architecture Overview

Formation Architect

Client

Extract,TransformRequirements

Generate Code

Windows NT

Server

Formation Flow Engine

Move Code to Server

Compile andLink Code

ExecuteJobs

Autoloaddatabase

Red Brick

Warehouse

Windows NTHP-UXSun SolarisCompaq’s Digital UNIXIBM AIX

IIT DB Class 76

Red Brick Formation Flow EngineIntelligent Parallelization & Synchronization

Op Op Op

Target

Op Op

Op Op

Group 1 Group 3

Group 2

BufferSource

IIT DB Class 77

GeneratedC++

Code

Red Brick Formation Operator Templates

Operator Templates/

Server Services

IIT DB Class 78

Red Brick FormationDemonstration

• An input file contains last night’s sales orders.

• Want to select items ordered that were sold at list price, not discounted.

The Problem ?

The Solution ?

• Red Brick Formation

IIT DB Class 79

Red Brick FormationSummary of Benefits

• Simplicity & Flexibility

• Scalability & Performance

• Changing as your business changes

Red Brick FormationRed Brick Formation automates the process of data warehouse generation and maintenance.

Designed for:

An Introduction to Data Warehousing

Documents

key store

key product

key promotion

key dollars

key sql

key andf

product table

product p