An Introduction to An Introduction to Data Warehousing Data Warehousing Anand Deshpande Anand Deshpande Persistent Systems Pvt. Ltd. Persistent Systems Pvt. Ltd. http://www.persistent.co.in http://www.persistent.co.in
Jan 06, 2016
An Introduction to Data An Introduction to Data Warehousing Warehousing An Introduction to Data An Introduction to Data Warehousing Warehousing
Anand DeshpandeAnand Deshpande
Persistent Systems Pvt. Ltd.Persistent Systems Pvt. Ltd.
http://www.persistent.co.inhttp://www.persistent.co.in
Optimizing the Warehouse Optimizing the Warehouse for Decision Supportfor Decision SupportOptimizing the Warehouse Optimizing the Warehouse for Decision Supportfor Decision Support
IIT DB Class 3
Data -- Heart of the Data Warehouse• Heart of the data warehouse is the data itself!• Single version of the truth• Corporate memory• Data is organized in a way that represents
business -- subject orientation
IIT DB Class 4
Consider a Retail Sales Example• A retail chain sells products in retail stores.• We want to track the effect of promotions on
retail sales.
• Product Table (60,000 SKUs)• Store Table (500 stores)• Promotion (3000 promotions per week)• Sales (2 Billion)
IIT DB Class 5
• What is the total dollar sales and the total dollar costs of all candy sold in supermarket stores on Saturdays?
SELECT p.category, sum(f.dollar_sales),sum(f.dollar.cost)
FROM sales_fact f, product p, time t, store s
WHERE f.product_key = p.product_key
and f.time_key = t.time_keyand f.store_key = s.store_key
and p.category = ‘Candy’
and t.day_of_week = ‘Saturday’and s.floor_plan_type = ‘Super_Market”GROUP BY p.category
IIT DB Class 6
time_keySQL_dateday_of_weekmonthfiscal_periodseason
store_keystore_IDstore_nameaddressregiondivisionfloor_plan_type
product_keySKUdescriptionbrandcategorydepartmentpackage_typesize… etc.
time_keyproduct_keystore_keypromotion_keydollars_soldunits_solddollars_cost
promotion_keypromotion_nameprice_typead_typedisplay_type… etc.
Store Dimension
Time Dimension
Product Dimension
Promotion Dimension
Sales FactTable
IIT DB Class 7
• SELECT NON-AGGREGATE FIELDNAME1,NON-AGGREGATE FIELDNAME2,SUM(AGGREGATE FIELDNAME3)SUM(AGGREGATE FIELDNAME4)SUM(AGGREGATE FIELDNAME5)
FROMDIMENSION TABLE1DIMENSION TABLE2DIMENSION TABLE3DIMENSION TABLE4FACT TABLE
WHEREJOINCONDITION1JOINCONDITION2 ANDJOINCONDITION3 ANDJOINCONDITION4 ANDDIMENSIONCONSTRAINT1 ANDDIMENSIONCONSTRAINT2 ANDDIMENSIONCONSTRAINT3 ANDDIMENSIONCONSTRAINT4 ANDDIMENSIONCONSTRAINT5 ANDDIMENSIONCONSTRAINT6
GROUP BYNON-AGGREGATE FIELDNAME1NON-AGGREGATE FIELDNAME2
ORDER BYNON-AGGREGATE FIELDNAME1NON-AGGREGATE FIELDNAME2
IIT DB Class 8
A Real Data Warehouse Query
Dwquery.txt
IIT DB Class 9
Schema Design
• Database organization– must look like business– must be recognizable by business user– approachable by business user– Must be simple
• Schema Types– Star Schema– Fact Constellation Schema– Snowflake schema
IIT DB Class 10
Dimension Tables
• Dimension tables– Define business in terms already familiar to users– Wide rows with lots of descriptive text– Small tables (about a million rows) – Joined to fact table by a foreign key– heavily indexed– typical dimensions
• time periods, geographic region (markets, cities), products, customers, salesperson, etc.
IIT DB Class 11
Fact Table
• Central table– mostly raw numeric items– narrow rows, a few columns at most– large number of rows (millions to a billion)– Access via dimensions
IIT DB Class 12
Star Schema
• A single fact table and for each dimension one dimension table
• Does not capture hierarchies directly
T ime
prod
cust
city
fact
date, custno, prodno, cityname, ...
IIT DB Class 13
Snowflake schema
• Represent dimensional hierarchy directly by normalizing tables.
• Easy to maintain and saves storage
T ime
prod
cust
city
fact
date, custno, prodno, cityname, ...
region
IIT DB Class 14
Fact Constellation
• Fact Constellation– Multiple fact tables that share many dimension
tables– Booking and Checkout may share many
dimension tables in the hotel industry
Hotels
Travel Agents
Promotion
Room Type
Customer
Booking
Checkout
IIT DB Class 15
Data Warehouse Structure
• Subject Orientation -- customer, product, policy, account etc... A subject may be implemented as a set of related tables. E.g., customer may be five tables
IIT DB Class 16
Data Granularity in Warehouse• Summarized data stored
– reduce storage costs– reduce cpu usage– increases performance since smaller number of
records to be processed– design around traditional high level reporting
needs– tradeoff with volume of data to be stored and
detailed usage of data
IIT DB Class 17
Granularity in Warehouse
• Can not answer some questions with summarized data– Did Anand call Seshadri last month? Not possible
to answer if total duration of calls by Anand over a month is only maintained and individual call details are not.
• Detailed data too voluminous
IIT DB Class 18
Granularity in Warehouse• Tradeoff is to have dual level of granularity
– Store summary data on disks• 95% of DSS processing done against this data
– Store detail on tapes• 5% of DSS processing against this data
IIT DB Class 19
Vertical Partitioning
Product_key SKU description brand category department package_type size… etc.
. . .
Product_key SKU
Product_key brand category department . . .
Frequentlyaccessed
Rarely accessed
Smaller tableand so less I/O
IIT DB Class 20
Derived Data• Introduction of derived (calculated data) may
often help• Have seen this in the context of dual levels of
granularity • Can keep auxiliary views and indexes to
speed up query processing
IIT DB Class 21
Denormalization• Normalization in a data warehouse may lead
to lots of small tables• Can lead to excessive I/O’s since many
tables have to be accessed• Denormalization is the answer especially
since updates are rare
IIT DB Class 22
Creating Arrays
• Many time each occurrence of a sequence of data is in a different physical location
• Beneficial to collect all occurrences together and store as an array in a single row
• Makes sense only if there are a stable number of occurrences which are accessed together
• In a data warehouse, such situations arise naturally due to time based orientation
» can create an array by month
IIT DB Class 23
Selective Redundancy• Description of an item can be stored
redundantly with order table -- most often item description is also accessed with order table
• Updates have to be careful
IIT DB Class 24
Partitioning• Breaking data into several
physical units that can be handled separately
• Not a question of whether to do it in data warehouses but how to do it
• Granularity and partitioning are key to effective implementation of a warehouse
IIT DB Class 25
Why Partitioning?• Flexibility in managing data• Smaller physical units allow
– easy restructuring– free indexing– sequential scans if needed– easy reorganization– easy recovery– easy monitoring
IIT DB Class 26
Criterion for Partitioning• Typically partitioned by
– date– line of business– geography– organizational unit– any combination of above
IIT DB Class 27
Where to Partition?• Application level or DBMS level• Makes sense to partition at application level
– Allows different definition for each year• Important since warehouse spans many years and as
business evolves definition changes
– Allows data to be moved between processing complexes easily
IIT DB Class 28
Where to Partition?• Application level or DBMS level• Makes sense to partition at application level
– Allows different definition for each year• Important since warehouse spans many years and as
business evolves definition changes
– Allows data to be moved between processing complexes easily
IIT DB Class 29
Indexing Techniques
• Bitmap index:– A collection of bitmaps -- one for each distinct
value of the column– Each bitmap has N bits where N is the number of
rows in the table– A bit corresponding to a value v for a row r is set if
and only if r has the value for the indexed attribute
IIT DB Class 30Customer Query : select * from customer wheregender = ‘F’ and vote = ‘Y’
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
Bitmap Index
M
F
F
F
F
M
Y
Y
Y
N
N
N
IIT DB Class 31
Join Indexes
• Pre-computed joins• A join index between a fact table and a
dimension table correlates a dimension tuple with the fact tuples that have the same value on the common dimensional attribute– e.g., a join index on city dimension of calls fact
table– correlates for each city the calls (in the calls table)
that originated from that city
IIT DB Class 32
Join Indexes• Join indexes can also span multiple
dimension tables– e.g., a join index on city and time dimension of calls fact table
IIT DB Class 33
Star Join Processing
• Use join indexes to join dimension and fact table
CallsC+T
C+T+L
C+T+L+P
Time
Loca-tion
Plan
IIT DB Class 34
Optimized Star Join Processing
Time
Loca-tion
Plan
Calls
Virtual Cross Productof T, L and P
Apply Selections
IIT DB Class 35
Bitmapped Join Processing
AND
Time
Loca-tion
Plan
Calls
Calls
Calls
Bitmaps101
001
110
IIT DB Class 36
Intelligent Scan
• Piggyback multiple scans of a relation (Redbrick)– piggybacking also done if second scan starts a
little while after the first scan
IIT DB Class 37
Parallel Query Processing
• Three forms of parallelism– Independent– Pipelined– Partitioned and “partition and replicate”
• Deterrents to parallelism– startup – communication
IIT DB Class 38
Parallel Query Processing
• Partitioned Data– Parallel scans– Yields I/O parallelism
• Parallel algorithms for relational operators– Joins, Aggregates, Sort
• Parallel Utilities– Load, Archive, Update, Parse, Checkpoint, Recovery
• Parallel Query Optimization
IIT DB Class 39
Pre-computed Aggregates
• Keep aggregated data for efficiency (pre-computed queries)
• Questions– Which aggregates to compute?– How to update aggregates?– How to use pre-computed aggregates in queries?
IIT DB Class 40
Pre-computed Aggregates• Aggregated table can be maintained by the
– warehouse server– middle tier – client applications
• Pre-computed aggregates -- special case of materialized views -- same questions and issues remain
IIT DB Class 41
Summary Management
Operational Data
Extract Incremental
Details
Transforms
Details and
Aggregates
Staging File
Query Rewrite
Extraction MDDB
Analysis Tools
Incremental Load
and Refresh
IIT DB Class 42
SQL Extensions
• Extended family of aggregate functions– rank (top 10 customers)– percentile (top 30% of customers)– median, mode– Object Relational Systems allow addition of new
aggregate functions
IIT DB Class 43
SQL Extensions• Reporting features
– running total, cumulative totals
• Cube operator– group by on all subsets of a set of attributes
(month,city)– redundant scan and sorting of data can be
avoided
IIT DB Class 44
Server Scalability
• Scalability is the #1 IT requirement for Data Warehousing
• Hardware Platform options– SMP– Clusters (shared disk)– MPP
• Loosely coupled (shared nothing)• Hybrid
IIT DB Class 45
SMP Characteristics
• SMP -- Symmetric multi processing -- shared
everything
• Multiple CPUs share same memory
• Workload is balanced across CPUs by OS
• Scalability is limited to bandwidth of internal bus and
OS architecture
• Not tolerant to failure in processing node
• Architecture is mostly invisible to applications
IIT DB Class 46
SMP Benefits• Lower entry point -- can start with SMP• Mature technology
IIT DB Class 47
MPP Characteristics
• Each node owns a portion of the database• Nodes are connected via an interconnection
network• Each node can be a single CPU or SMP• Load balancing done by application• High scalability due to local processing
isolation
IIT DB Class 48
MPP benefits• High availability• High scalability
Viewing the Data with Viewing the Data with OLAPOLAPViewing the Data with Viewing the Data with OLAPOLAP
Making Decision Support PossibleMaking Decision Support Possible
IIT DB Class 50
Limitations of SQL
“A Freshman in
Business needs a
Ph.D. in SQL”
-- Ralph Kimball
IIT DB Class 51
Typical OLAP Queries• Write a multi-table join to compare sales for each product line
YTD this year vs. last year.
• Repeat the above process to find the top 5 product contributors
to margin.
• Repeat the above process to find the sales of a product line to
new vs. existing customers.
• Repeat the above process to find the customers that have had
negative sales growth.
IIT DB Class 52
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
What Is OLAP?
• Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software*
• Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System
• OLAP = Multidimensional Database• MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express)• ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS
Agent)
IIT DB Class 53
Strengths of OLAP• It is a powerful visualization paradigm
• It provides fast, interactive response times
• It is good for analyzing time series
• It can be useful to find some clusters and outliners
• Many vendors offer OLAP tools
IIT DB Class 54MonthMonth1 1 22 3 3 4 4 776 6 5 5
Pro
du
ctP
rod
uct
Toothpaste Toothpaste
JuiceJuiceColaColaMilk Milk
CreamCream
Soap Soap
Regio
n
Regio
n
WWS S
N N
Dimensions: Dimensions: Product, Region, TimeProduct, Region, TimeHierarchical summarization pathsHierarchical summarization paths
Product Product Region Region TimeTimeIndustry Country YearIndustry Country Year
Category Region Quarter Category Region Quarter
Product City Month WeekProduct City Month Week
Office DayOffice Day
Multi-dimensional Data
• “Hey…I sold $100M worth of goods”
IIT DB Class 55
Visualizing Neighbors is simpler 1 2 3 4 5 6 7 8 Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
Month Store Sales Apr 1 Apr 2 Apr 3 Apr 4 Apr 5 Apr 6 Apr 7 Apr 8 May 1 May 2 May 3 May 4 May 5 May 6 May 7 May 8 Jun 1 Jun 2
IIT DB Class 56
“Slicing and Dicing”
Product
Sales Channel
Regio
ns
Retail Direct Special
Household
Telecomm
Video
Audio IndiaFar East
Europe
The Telecomm Slice
IIT DB Class 57
Roll-up and Drill Down
• Sales Channel• Region• Country• State • Location Address• Sales Representative
Roll
Up
Higher Level ofAggregation
Low-levelDetails
Drill-D
ow
n
IIT DB Class 58
Nature of OLAP Analysis
• Aggregation -- (total sales, percent-to-total)
• Comparison -- Budget vs. Expenses• Ranking -- Top 10, quartile analysis• Access to detailed and aggregate
data• Complex criteria specification• Visualization
IIT DB Class 59
Organizationally Structured Data• Different Departments look at the same detailed data
in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data
marketing
manufacturing
sales
finance
IIT DB Class 60
Multidimensional Spreadsheets• Analysts need spreadsheets that
support– pivot tables (cross-tabs)– drill-down and roll-up– slice and dice– sort– selections– derived attributes
• Popular in retail domain
IIT DB Class 61
SQL Extensions
• Front-end tools require– Extended Family of Aggregate Functions
• rank, median, mode
– Reporting Features• running totals, cumulative totals
– Results of multiple group by• total sales by month and total sales by product
– Data Cube
Red Brick Red Brick FormationFormation™™Red Brick Red Brick FormationFormation™™
Extraction, Extraction,
Transformation, and Transformation, and
Data LoadingData Loading
IIT DB Class 63
CorporateData
Red Brick Red Brick Data MineData Mine
Red Brick Red Brick WarehouseWarehouse
RDBMSRDBMS
Red Brick Red Brick High PerformanceHigh Performance
LoaderLoader
PurchasedData
End User Query
Data Mining
ExtractExtract
IntegrateIntegrate
TransformTransform
CleanseCleanse
TransportTransport
LoadLoad
Red Brick Formation
Services
Products
Methodology
Partnerships
The Red BrickDecisionScape Environment
IIT DB Class 64
Loading the Data Warehouse
• Data extractionData transformationData loading is…
– 40% of total cost
– 80% of total time and effort
IIT DB Class 65
It’s All About Data Transformation• Extensive Data Manipulation:
– Integrating data from dissimilar sources – Cleansing source data– Creating and storing aggregates & summaries– Deriving new data– Modifying existing data– Adapting to change in the business– Capturing and storing metadata (documentation)
IIT DB Class 66
Traditional “Solutions”• What are they?
– 3GL (COBOL, BASIC, PL1, C, etc)– 4GL (SAS, FOCUS, EASYTRIEVE, etc)
• Why use them?– Availability of programming staff– Known technology– “I can do it better” attitude of in-house staff
• Why you shouldn’t use them:– Little to no metadata– Expensive to develop and maintain in an iterative
development environment
• The Better Choice … A Data Extraction/Transformation Tool
IIT DB Class 67
Red Brick Formation:A Better Choice
• Model-driven transformation results in:
• Development productivity:– 2x more productive than hand-coding.– Less experience required to create jobs.
• Maintenance productivity:– 8-10x more productive than hand-coding.– Extensive code re-use– Self-documenting at every step.
IIT DB Class 68
Red Brick Formation Key Product Highlights
• Flexible and Easy to Use– Visual data flow diagramming with
optimized, pre-built operators
• Scalability & Performance– Generates standard C++ code– Runs on Windows NT and UNIX– Single pass handles multiple sources and targets – Intelligent parallelization and synchronization
• Extensibility & ROI– 3rd-party integration– Productive use of people resource
• Added Bonus – Integration with Red Brick Warehouse
IIT DB Class 69
Red Brick Formation Key Product Features/Benefits
• Extensive Pre-built, Optimized Operators & Functions
• Formation’s unique coding of transformation rules via Visual Snippets -- makes development & maintenance easy
• Job performance is enhanced and job development more standardized by utilizing pre-built operators and functions
• Visual, Process-Oriented Model for Designing & Maintaining Jobs
• Single Pass Processing of Multiple, Heterogeneous Sources & Targets
• Scalable Client/Server Architecture
• Single pass processing means that less steps and less I/O overhead leading to more efficient processing of data
• Scalable architecture means jobs designed for pilot or prototype will also work within “assigned windows” to process production volume of data
IIT DB Class 70
Red Brick FormationVisual, process-oriented modelRed Brick FormationVisual, process-oriented model
IIT DB Class 71
Red Brick FormationAny Number of Data SourcesRed Brick FormationAny Number of Data Sources
• Any heterogeneous mix of – Red Brick Warehouse– Flat Files (Fixed and Delimited)– RDBMS (Oracle, Microsoft SQLServer)
Red BrickFormation
IIT DB Class 72
Red Brick FormationRobust Data Transformation Operators
Red Brick FormationRobust Data Transformation Operators
• Operators– Aggregate– Cursor– Deduplicate– Filter– Partition– Gather– Group By– Household– Join– Advanced Join– Cross Product Join
– First Normal– Program– Sort– Split– Union– File Import– File Export– Red Brick Import– Red Brick Export– MS SQLServer Import– Oracle Import– Oracle Export
IIT DB Class 73
Red Brick FormationMore than 200 Built-in FunctionsRed Brick FormationMore than 200 Built-in Functions
• Data Types– Integer, Unsigned Integer
– Float, Double, Decimal
– Date, Time, Timestamp, Interval
– Text (Fixed and Variable) and BLOB
• Functions– Math (Add, Subtract, Multiply, Divide, Power, Square Root, Absolute Value, Max, Min and more)
– Comparison (EQ, NE,GT,GE,LT,LE)
– Logical (And, Or, Not)
– Text (Search and Compare, Concatenations, Substring, Upcase, Downcase, and more)
– Data Type Conversions (Implicit & Explicit)
IIT DB Class 74
Red Brick FormationMajor ComponentsRed Brick FormationMajor Components
Client / Server Implementation
• Client: Formation Architect and Integrated Metadata Facility– Intel/Windows NT
• Server: Formation Flow Engine
– Intel/Windows NT– HP-UX– Sun Solaris– Compaq’s Digital Unix– IBM AIX
IIT DB Class 75
Red Brick Formation Architecture Overview
Formation Architect
Client
Extract,TransformRequirements
Generate Code
Windows NT
Server
Formation Flow Engine
Move Code to Server
Compile andLink Code
ExecuteJobs
Autoloaddatabase
Red Brick
Warehouse
Windows NTHP-UXSun SolarisCompaq’s Digital UNIXIBM AIX
IIT DB Class 76
Red Brick Formation Flow EngineIntelligent Parallelization & Synchronization
Op Op Op
Target
Op Op
Op Op
Group 1 Group 3
Group 2
BufferSource
IIT DB Class 77
GeneratedC++
Code
Red Brick Formation Operator Templates
Operator Templates/
Server Services
IIT DB Class 78
Red Brick FormationDemonstration
• An input file contains last night’s sales orders.
• Want to select items ordered that were sold at list price, not discounted.
The Problem ?
The Solution ?
• Red Brick Formation
IIT DB Class 79
Red Brick FormationSummary of Benefits
• Simplicity & Flexibility
• Scalability & Performance
• Changing as your business changes
Red Brick FormationRed Brick Formation automates the process of data warehouse generation and maintenance.
Designed for: