IT-357: Data-Warehousing
Motivation for Data Warehousing
• We have mountains of data in this company, but we can’t access it.
• You’ve got to make it easy for business people to get at the data directly.
• Just show me what is important.• It drives me crazy to have two people present
the same business metrics at a meeting, but with different numbers.
• We want people to use information to support more fact-based decision making.
• We need to “slice and dice” the data every which way.
Motivation for Data Warehousing• The data warehouse must:
– Make an organization’s information easily accessible
– Present the organization’s information consistently
– Be adaptive and resilient to change– Be secure such that protects our information
assets– Serve as the foundation for improved decision
making– Must ensure the business community accept
the data warehouse if it is to be deemed successful
Typical Queries on DW
• What was the total number of Cell Phones sold in India in 2013 group by companies?
• What was the total revenue for property sales for each type of property in Mangalore between 2006 and 2008?
• What would be the effect on cell phone sales in the Mangalore if a new college is opened?
• Which type of Cell Phone sells most in Mangalore?
• Which is the most travelled train in India in 2013?
Data Warehouse• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from
the organization’s operational database
– Support information processing by providing a solid platform of
consolidated, historical data for analysis.
• “A data warehouse is a
– subject-oriented
– integrated
– time-variant
– nonvolatile
W. H. Inmon
Ralph Kimball
DW – Subject Oriented
• Organized around major subjects, such as customer, product, sales
• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
• Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
DW - Integrated• Constructed by integrating multiple,
heterogeneous data sources– relational databases, flat files, on-line transaction
records• Data cleaning and data integration techniques
are applied.– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data sources
• E.g., Hotel price: currency, tax, food products, etc.
– When data is moved to the warehouse, it is converted.
DW – Time Variant• The time horizon for the data warehouse is
significantly longer than that of operational systems– Operational database: current value data
– Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain “time element”
DW – Non-volatile• A physically separate store of data transformed
from the operational environment
• Operational update of data does not occur in the data warehouse environment– Does not require transaction processing, recovery,
and concurrency control mechanisms
– Requires only two to three operations in data accessing:
• initial loading, incremental loading of data and access of data
Difference between OLTP and OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date
detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc/repetitive access read/write
index/hash on prim. key lots of scans
unit of work short, simple transaction complex query # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response
DW Building Operations (ETL)• Data extraction
– Get data from multiple, heterogeneous, and external sources
• Data cleaning– Detect errors in the data and rectify them when
possible• Data transformation
– Convert data from legacy or host format to warehouse format
• Load/Refresh– Sort, summarize, consolidate, compute views, check
integrity, and build indexes and partitions– Propagate updates from sources
Who does not need a Data Warehouse
• If the business does not need a unified set of information/unified view– If the same person is involved in research, manufacturing, sale,
support, relationship• If there is little or no latency time in the access and
analysis of information• If the same product is sold over and over• If there is no need for historical data
• Family businesses
• Small businesses
What is Meta Data• Meta data is the data defining warehouse objects (re-usable). It
stores:
• Description of the structure of the data warehouse– Schema, view, dimensions, hierarchies, derived data definition, data
mart locations and contents
• Operational meta-data– History of migrated data, currency of data (active, archived, or purged),
algorithms used for queries
• The mapping from operational environment to the data warehouse• Data related to system performance
– Number of users, system checks performed, failures
• Business data– Business terms and definitions, ownership of data, policies (scope of
DW, security)
Data Warehouse Meta Data • Metadata for the data warehouse environment is
one of the most important aspects. Metadata helps the DSS analyst find what data is in the warehouse and use that data effectively and efficiently.
• Some of the components of data warehouse metadata are:
• The structure and contents of the warehouse• The mapping of data into the data warehouse• The extract/transformation history• Aging purging criteria• Ownership/stewardship information
Uses of Metadata
• Some of the uses– Extraction and loading processes - metadata is used
to map data sources to a common view of information within the warehouse
– Warehouse management process - metadata is used to automate the production of summary tables
– Query management process - metadata is used to direct a query to the most appropriate data source
RDBMS(OLTP) Vs DW (OLAP) StructureRegionReg_ID Reg_Name
1 Europe
2 North America
3 Asia
LocationRegion Country CityEurope Germany FrankfurtEurope Spain MadridNorth America Canada VancouverNorth America Mexico Mexico CityAsia India DelhiAsia China BeijingAsia India Mumbai
CountryCntry_ID Reg_ID Cntry_Name
1 1 Germany2 1 Spain3 2 Canada4 2 Mexico5 3 India6 3 China
CityCity_ID Reg_ID Cntry_ID City_Name
1 1 1 Frankfurt2 2 2 Vancouver3 2 3 Toronto4 2 4 Mexico City5 3 5 Delhi6 3 6 Beijing7 3 5 Mumbai8 1 2 Madrid
SQL for OLTP and OLAP• OLTP:
select * from region, country, city where region.reg_id = country.reg_id and country.reg_id = city.reg_id
• OLAP:– select * from location
• OLTP for quick Insert/Update• OLAP for quick Data Access
Data Warehouse Usage
• Three kinds of data warehouse applications– Information Processing
• supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs
– Analytical Processing
• multidimensional analysis of data warehouse data
• supports basic OLAP operations: slice-dice, drill down, roll-up
– Data Mining
• knowledge discovery from hidden patterns
• supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools
DW Queries – Complexity• Complexity by just adding a maximum price:
• Query 1: A simple data cube query: Find the total sales in 2004, broken down by product, region, and month, with subtotals for each dimension.
• Query 2: A complex data cube query: Grouping by all subsets of product, region, month, find the maximum price in 2004 for each group and the total sales among all maximum price tuples
OLAP and Data Mining• Is OLAP data mining? NO
• OLAP is a way to look at pre-aggregated query results (Evaluate query and results)
• Data Mining is building models of data
• Data mining tools model data and return actionable rules
• Practically speaking: OLAP tools can be used on Data Cubes as well as perform some form of Data Mining
Types of Data Warehouse Models• Various Type of FACTS/DIMENSION Tables
– Star schema: A fact table in the middle connected to a set of dimension tables
– Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
– Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation
Indexes for Data Warehouse
• Different types between OLTP and OLAP
• Ordinary Index • Bitmap Index• JOIN Index
Ordinary Index
• Surrogate Key – Most cases it is not a Natural key in addition
to the Business key. – Represents an object in the database, but not
visible outside• B-Tree (B Plus Tree)• Must be used with Unique or High
Cardinal Data (Near Unique)
Bitmap Index
• Index on a particular column• Each value in the column has a bit vector• The length of the bit vector: # of records in the
base table
Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer
RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1
RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0
Base table Index on Region Index on Type
Usage of Bitmap Index
• Static Tables – Data is not updated frequently. – Modification of Bitmap Index is expensive– They are compressed index type
• DSS Systems– Generally suited for low cardinal values (but
need not be limited to)– Suited for systems which gets changes during
non-peak business hours
JOIN Index• In data warehouses, join index relates the
values of the dimensions of a start schema to rows in the fact table
• Eg: Sales and two dimensions city and product– A join index on city maintains for each
distinct city a list of R-IDs (Prim key) of the tuples recording the Sales in the city
• Join indices can span multiple dimensions
Size of a Data Warehouseand Query Cost
• Number of Tuples/Records in DIMENSIONS
• Number of Tuples/Records in FACT– Eg: Customer/Sale data
• Cost of a query to locate a tuple using an Index?
Optimizing Data Warehouse
1. Better understanding of the actual data2. Better design (proper de-normalization)3. Indexing:
– Faster access4. Partitioning:
– Physical partitioning– Eg: Partitioning Date/Time Dimension
Data Cube
• The key operation of a OLAP is the formation of a data cube
• Pre-computed query result
• A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts
• A data cube is a multidimensional representation of data, together with all possible aggregates
A Spreadsheet Data
Date Location Product Sales
1-Jan-13 USA TV 100
2-Jan-13 Canada TV 250
3-Jan-13 Mexico TV 300
4-Jan-13 Brazil TV 200
1-Jan-13 USA PC 50
2-Jan-13 Canada PC 70
3-Jan-13 Mexico PC 40
4-Jan-13 Brazil PC 60
Represent in a 3-Dimension
• Consider previous sales of products at a number of locations at various dates
• This data can be represented as a 3 dimensional array
A Sample Data CubeTotal annual salesof TVs in U.S.A.Date
Produ
ct
Cou
ntrysum
sumTV
VCRPC
1Qtr 2Qtr 3Qtr 4QtrAmericas
Asia Pacific
Europe
sum
Data Cube: Lattice of Cuboids – 3D
All (highest level of summarization)
product date country
product,date product,country date, country
product, date, country (lowest level of summarization)
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D (base) cuboid
A Simple Representation
• Base and Aggregate cells. – Consider the data cube with the DIMENSION
Date, Product, County and the FACT Quantity.
– 1D Cells: (Jan, *, *, 350) – 1D Cells: (Feb, *, *, 50)
– 2D Cells: (Jan, * , Mexico, 70) – 3D Cells: (Aug, TV, USA, 80)
Data Cube: Lattice of Cuboids – 4D
All (highest level of summarization)
time,product
time,product,location
time, product, location, supplier (lowest level of summarization)
time product location supplier
time,location
time,supplier
product,location
product,supplier
location,supplier
time,product,supplier
time,location,supplier
product,location,supplier
0-D (apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D (base) cuboid
Bottom Up Cube Calculation (BUC)
• 4 Dimension Computation (A, B, C, D attributes)–1D Cells: (Jan, *, *,*, 500)–2D Cells: (*, TV, *,Philips, 300)
The “Compute Cube” Operator• Cube definition and computation in DMQL
define cube sales [product, city, year]: sum (sales_in_dollars)
compute cube sales
• Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.’96)
SELECT product, city, year, SUM (amount)FROM SALESCUBE BY product, city, year
• Need compute the following Group-Bys(date, product, city),(date,product),(date, city), (product, city),(date), (product), (city)()
(product)(city)
()
(year)
(city, product) (city, year) (product, year)
(city, product, year)
Oracle’s “CUBE” Operator• CUBE extension will generate subtotals for all
combinations of the dimensions specified
• As the number of dimensions increase, so do the combinations of subtotals that need to be calculated
• SQL: SELECT Date_Id, Product_Id, City_id, SUM(sales_value) AS sales_value FROM dimension_tab GROUP BY CUBE (Date_Id, Product_Id, City_id) ORDER BY Date_Id, Product_Id, City_id;
Attribute Oriented Induction• Proposed in 1989 [Before Data Cube concept was
introduced]• How it is done?
– Collect the task-relevant data (data for analysis) using a relational database query (initial relation)
– Perform generalization by attribute removal or attribute generalization
– Apply aggregation by merging identical, generalized tuples and accumulating their respective counts
– Interaction with users for knowledge presentation
Example for Attribute Oriented Induction• A DMQL Query
1. Use DB_XYZ mine characteristics as “Science Students”
in relevance to name, gender, major, birth place, birth date, residence, phone#, gpa
from studentwhere status in “graduate”
Algorithm to solve the Example2. Remove/Generalizing Attributes:
– Large set of distinct values• Case 1: EITHER there is no concept hierarchy defined
within the attribute• Case 2: OR Higher level concepts are defined in terms of
other attributes
– Remove the attribute when any one of them are true.
3. Class Characterization:– Generalization will result in groups of identical
tuples– Count the number of generalized (duplicate) tuples
and mark the count
Class CharacterizationName Gender Major Birth-Place Birth_date Residence Phone # GPA
JimWoodman
M CS Vancouver,BC,Canada
8-12-76 3511 Main St.,Richmond
687-4598 3.67
ScottLachance
M CS Montreal, Que,Canada
28-7-75 345 1st Ave.,Richmond
253-9106 3.70
Laura Lee…
F…
Physics…
Seattle, WA, USA…
25-8-70…
125 Austin Ave.,Burnaby…
420-5232…
3.83…
Removed Retained Sci,Eng,Bus
Country Age range City Removed Excl,VG,..
Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … …
Importance of Understanding the Data
• The Developer/Tech Analyst/Bus Analyst must know the data well (very)
• Unlike Relational databases where design, efficiency of data retrieval, GUI etc (Although data too is important)
• The data and how the data is stored and the data quality become more and more important in OLAP databases/Data Warehouses