DW e Modelo DW e Modelo multidimensionalmultidimensional
(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)
SAD Tagus 2004/05 H. Galhardas
What is a data warehouse?What is a data warehouse?
A multi-dimensional data modelA multi-dimensional data model
OLAP operationsOLAP operations
OutlineOutline
SAD Tagus 2004/05 H. Galhardas
What is Data What is Data Warehouse?Warehouse?
Defined in many different ways, but not rigorously.Defined in many different ways, but not rigorously.
A decision support database that is maintained separately from
the organization’s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
““A data warehouse is a A data warehouse is a subject-orientedsubject-oriented, , integratedintegrated, , time-time-
variantvariant, and , and nonvolatilenonvolatile collection of data in support of collection of data in support of
management’s decision-making process.”—W. H. Inmonmanagement’s decision-making process.”—W. H. Inmon
Data warehousingData warehousing::
The process of constructing and using data warehouses
SAD Tagus 2004/05 H. Galhardas
Data Warehouse—Data Warehouse—Subject-OrientedSubject-Oriented
Organized around major subjects, such as Organized around major subjects, such as customer, customer,
product, salesproduct, sales..
Focusing on the modeling and analysis of data for Focusing on the modeling and analysis of data for
decision makers, not on daily operations or decision makers, not on daily operations or
transaction processing.transaction processing.
Provide Provide a simple and concisea simple and concise view around particular view around particular
subject issues by subject issues by excluding data that are not useful in excluding data that are not useful in
the decision support processthe decision support process..
SAD Tagus 2004/05 H. Galhardas
Data Warehouse—Data Warehouse—IntegratedIntegrated
Constructed by Constructed by integratingintegrating multiple, heterogeneous multiple, heterogeneous data sourcesdata sources relational databases, flat files, on-line transaction records
Data cleaning and data integrationData cleaning and data integration techniques are techniques are applied.applied. Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is converted.
SAD Tagus 2004/05 H. Galhardas
Data Warehouse—Time Data Warehouse—Time VariantVariant
The time horizon for the data warehouse is The time horizon for the data warehouse is
significantly longer than that of operational systems.significantly longer than that of operational systems. Operational database: current value data.
Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
Every key structure in the data warehouse contains Every key structure in the data warehouse contains
an an element of timeelement of time, explicitly or implicitly, explicitly or implicitly But the key of operational data may or may not contain
“time element”.
SAD Tagus 2004/05 H. Galhardas
Data Warehouse—Non-Data Warehouse—Non-VolatileVolatile
A A physically separate storephysically separate store of data transformed of data transformed
from the operational environment.from the operational environment.
Operational Operational update of data does not occurupdate of data does not occur in the in the
data warehouse environment.data warehouse environment. Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
SAD Tagus 2004/05 H. Galhardas
Data Warehouse vs. Data Warehouse vs. Heterogeneous DBMSHeterogeneous DBMS
Traditional heterogeneous DB integration: Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases
Query driven approach
When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: Data warehouse: update-drivenupdate-driven, high performance, high performance Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis
SAD Tagus 2004/05 H. Galhardas
Data Warehouse vs. Data Warehouse vs. Operational DBMSOperational DBMS
OLTP (on-line transaction processing)OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making
Distinct features (OLTP vs. OLAP):Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
SAD Tagus 2004/05 H. Galhardas
OLTP vs. OLAPOLTP vs. OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
SAD Tagus 2004/05 H. Galhardas
Why Separate Data Why Separate Data Warehouse?Warehouse?
High performance for both systemsHigh performance for both systems DBMS— tuned for OLTP: access methods, indexing, concurrency control,
recovery
Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view,
consolidation
Different functions and different data:Different functions and different data: missing data: Decision support requires historical data which operational DBs do
not typically maintain
data consolidation: DS requires consolidation (aggregation, summarization) of
data from heterogeneous sources
data quality: different sources typically use inconsistent data representations,
codes and formats which have to be reconciled
Note: There are more and more systems which perform OLAP analysis Note: There are more and more systems which perform OLAP analysis
directly on relational databasesdirectly on relational databases
SAD Tagus 2004/05 H. Galhardas
What is a data warehouse?What is a data warehouse?
A multi-dimensional data modelA multi-dimensional data model
OLAP operationsOLAP operations
OutlineOutline
SAD Tagus 2004/05 H. Galhardas
From Tables and From Tables and Spreadsheets to Data Spreadsheets to Data
CubesCubes A data warehouse is based on a A data warehouse is based on a multidimensional data multidimensional data
modelmodel which views data in the form of a data cube which views data in the form of a data cube
A A data cubedata cube, such as sales, allows data to be modeled , such as sales, allows data to be modeled
and viewed in multiple dimensionsand viewed in multiple dimensions
A data cube is defined by:A data cube is defined by: Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
Fact table contains (usually, numerical) measures or facts (such
as dollars_sold) and keys to each of the related dimension
tables
SAD Tagus 2004/05 H. Galhardas
2-D view of sales data 2-D view of sales data according to dimensions according to dimensions
time and itemtime and item
SAD Tagus 2004/05 H. Galhardas
3-D view of sales data 3-D view of sales data according to dimensions according to dimensions
time, item, locationtime, item, location
SAD Tagus 2004/05 H. Galhardas
3-D data cube 3-D data cube representationrepresentation
SAD Tagus 2004/05 H. Galhardas
4-D view of sales data 4-D view of sales data according to dimensions according to dimensions
time, item, location, time, item, location, suppliersupplier
SAD Tagus 2004/05 H. Galhardas
Data CubeData Cube
In data warehousing literature, an n-D base cube is In data warehousing literature, an n-D base cube is
called a called a base cuboidbase cuboid. The top most 0-D cuboid, which . The top most 0-D cuboid, which
holds the highest-level of summarization, is called the holds the highest-level of summarization, is called the
apex cuboidapex cuboid. The lattice of cuboids forms a . The lattice of cuboids forms a data cube.data cube.
Any n-D data can be displayed as a series of Any n-D data can be displayed as a series of (n-1)-D(n-1)-D
cubescubes
Data cube is a logical representation; its physical Data cube is a logical representation; its physical
storage may be differentstorage may be different
SAD Tagus 2004/05 H. Galhardas
Cube: A Lattice of Cube: A Lattice of CuboidsCuboids
time,item
time,item,location
time, item, location, supplier
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
SAD Tagus 2004/05 H. Galhardas
Conceptual Modeling of Conceptual Modeling of DWDW
A multidimensional model can exist in the following A multidimensional model can exist in the following
forms:forms: Star schema: A fact table without redundancy in the middle
connected to a set of dimension tables
Snowflake schema: A refinement of star schema where
some dimensions are normalized therefore splitting the
data into additional tables. The resulting schema graph
forms a shape similar to snowflake
Fact constellations or galaxy schema: Multiple fact tables
share dimension tables, viewed as a collection of stars
SAD Tagus 2004/05 H. Galhardas
Example of a Star SchemaExample of a Star Schema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
SAD Tagus 2004/05 H. Galhardas
Example of Snowflake Example of Snowflake SchemaSchema
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycityprovince_or_streetcountry
city
SAD Tagus 2004/05 H. Galhardas
Example of Fact Example of Fact ConstellationConstellation
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_streetcountry
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
SAD Tagus 2004/05 H. Galhardas
Escolha das tabelas de Escolha das tabelas de factos e de dimensõesfactos e de dimensões
Análise das interrogaçõesAnálise das interrogaçõesatributos group-by indicam as dimensõesatributos agregados referem as medidas ou factosatributos where são os atributos da tabelas
de dimensões Exemplo :Exemplo :
select s s.time_key, s.location_key, sum(s.units_sold)from item i, sales swhere i.item_key=sales.item_key and i.item_name = ‘clothes’group by s.time_key, s.location_key
SAD Tagus 2004/05 H. Galhardas
Categories of measures Categories of measures
DistributiveDistributive: if the result derived by applying the function to : if the result derived by applying the function to n n aggregate values is the same as that derived by applying aggregate values is the same as that derived by applying the function on all the data without partitioning.the function on all the data without partitioning.
Ex: count(), sum(), min(), max().
AlgebraicAlgebraic:: if it can be computed by an algebraic function with if it can be computed by an algebraic function with MM arguments (where arguments (where M M is a constant), each of which is is a constant), each of which is obtained by applying a distributive aggregate function.obtained by applying a distributive aggregate function.
Ex: avg(), min_N(), standard_deviation().
HolisticHolistic:: if there is no algebraic function with M arguments if there is no algebraic function with M arguments that characterizes the computation.that characterizes the computation.
Ex: median(), mode(), rank().
SAD Tagus 2004/05 H. Galhardas
Hierarquia de conceitosHierarquia de conceitos
Define uma sequência de correspondências Define uma sequência de correspondências entre um conjunto de conceitos de baixo nível e entre um conjunto de conceitos de baixo nível e um conjunto de conceitos de alto nível.um conjunto de conceitos de alto nível.
Exemplos: dimensão localização e dimensão Exemplos: dimensão localização e dimensão tempotempo
SAD Tagus 2004/05 H. Galhardas
Dimensão Dimensão locationlocation
all
Europe North_America
MexicoCanadaSpainGermany
Vancouver
M. WindL. Chan
...
......
... ...
...
all
region
office
country
TorontoFrankfurtcity
SAD Tagus 2004/05 H. Galhardas
Dimensão Dimensão TimeTime
year
quarter
month
day
week
SAD Tagus 2004/05 H. Galhardas
Multidimensional DataMultidimensional Data Sales volume as a function of product, Sales volume as a function of product,
month, and regionmonth, and region
Pro
duct
Regio
n
Month
Dimensions: Product, Location, TimeHierarchical summarization paths
Industry Region Year
Category Country Quarter
Product City Month Week
Office Day
SAD Tagus 2004/05 H. Galhardas
A Star-Net Query A Star-Net Query ModelModel
Shipping Method
AIR-EXPRESS
TRUCKORDER
Customer Orders
CONTRACTS
Customer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
OrganizationPromotion
CITY
COUNTRY
REGION
Location
DAILYQTRLYANNUALYTime
Each circle is called a footprint
SAD Tagus 2004/05 H. Galhardas
What is a data warehouse?What is a data warehouse?
A multi-dimensional data modelA multi-dimensional data model
OLAP operationsOLAP operations
OutlineOutline
SAD Tagus 2004/05 H. Galhardas
OLAP queries (1)OLAP queries (1)
Pivoting:Pivoting: Aggregation on selected Aggregation on selected dimensions.dimensions.E.g., Pivoting on Location and
Time yields this cross-tabulation:
63 81 144
38 107 145
75 35 110
WI CA Total
1995
1996
1997
176 223 339Total
price
category
pname
pid country
statecitylocid
sales
locidtimeid
pid
holiday_flag
weekdate
timeid month
quarter
year
(Fact table)SALES
TIMES
PRODUCTSLOCATIONS
SAD Tagus 2004/05 H. Galhardas
OLAP Queries (2)OLAP Queries (2)
Roll-up:Roll-up: Aggregating at different levels of Aggregating at different levels of a dimension hierarchy. a dimension hierarchy. Given total sales by city, we can roll-up to get
sales by state.
SAD Tagus 2004/05 H. Galhardas
SQL Group By operatorSQL Group By operator
Grouping Values
Partitioned Table
Sum()
Aggregate Values
The GROUP BY relational operator partitions a table into groups. Each group is then aggregated by a function. The aggregation function summarizes some column of groups returning a value for each group.
SAD Tagus 2004/05 H. Galhardas
SQL to express OLAP SQL to express OLAP queries queries
The cross-tabulation can be computed using a The cross-tabulation can be computed using a collection of SQL queries, e.g.:collection of SQL queries, e.g.:
SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.state
SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year
SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.state
SAD Tagus 2004/05 H. Galhardas
Problems with Group ByProblems with Group By
HistogramsHistogramsEx: SELECT day, nation, MAX(Temp)FROM WeatherGROUP BY Day(Time) AS day, Nation(Latitude, Longitude) AS nation;
Roll-up totals and sub-totals for drill-downsRoll-up totals and sub-totals for drill-downs Cross-tabulationsCross-tabulations
SAD Tagus 2004/05 H. Galhardas
Roll-up totals and sub-Roll-up totals and sub-totals for drill-downs totals for drill-downs
SALES
Model Year Color Sales
Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39
SAD Tagus 2004/05 H. Galhardas
Sales Roll Up by Model by Year by Color
Sales Roll Up by Model by Year by Color
Model Year Color Sales by Modelby Year
by Color
Salesby Modelby Year
Salesby Model
Chevy
1994 black 50
white 40
90
1995 black 85
white 115
200
290
SAD Tagus 2004/05 H. Galhardas
Sales Roll-Up by Model by Year by Color – relational
representation
Model Year Color Sales Salesby Modelby Year
Salesby Model
Chevy
1994 black 50 90 290
Chevy 1994 white 40 90 290
Chevy 1995 black 85 200 290
Chevy 1995 white 115 200 290
SAD Tagus 2004/05 H. Galhardas
An Excel pivot table representation with Ford
sales data included
Sum Year Color
Sales 1994 1994 Total
1995 1995 Total
Grand Total
Model black white black white
Chevy 50 40 90 85 115 200 290
Ford 50 10 60 85 75 160 220
Grand Total 100 50 150 170 190 360 510
SAD Tagus 2004/05 H. Galhardas
Sales Summary – relational representation of the pivot table
Model Year Color Units
Chevy 1994 black 50
Chevy 1994 white 40
Chevy 1994 ALL 90
Chevy 1995 black 85
Chevy 1995 white 115
Chevy 1995 ALL 200
Chevy ALL ALL 290
SAD Tagus 2004/05 H. Galhardas
SQL code SQL code SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Sales)FROM SalesWHERE Model = 'Chevy'UNION SELECT Model, ‘ALL’, ‘ALL’, SUM(Sales)FROM SalesWHERE Model = 'Chevy'GROUP BY ModelUNION SELECT Model, Year, ‘ALL’, SUM(Sales)FROM SalesWHERE Model = 'Chevy'GROUP BY Model, YearUNIONSELECT Model, Year, Color, SUM(Sales)FROM SalesWHERE Model = 'Chevy'GROUP BY Model, Year, Color;
SAD Tagus 2004/05 H. Galhardas
Cross tab – symmetric 2D aggregation
Table 6.a: Chevy Sales Cross Tab
Chevy 1994 1995 total (ALL)
black 50 85 135
white 40 115 155
total (ALL) 90 200 290
SAD Tagus 2004/05 H. Galhardas
Problems with SQL Problems with SQL representationrepresentation
Expressing roll-ups and cross-tabs queries Expressing roll-ups and cross-tabs queries with SQL is dauntingwith SQL is daunting
Too complex to analyze for optimizationToo complex to analyze for optimization
Propose an extension of the relational Propose an extension of the relational group by – the group by – the cube operatorcube operator
SAD Tagus 2004/05 H. Galhardas
Cube OperatorCube Operator
REDWHITEBLUE
By Color
By Make & Color
By Make & Year
By Color & Year
By MakeBy Year
Sum
The Data Cube and The Sub-Space AggregatesSum
REDWHITEBLUE
Chevy Ford
By Make
By ColorCross Tab
REDWHITEBLUE
By Color
Sum
Group By (with total)
Sum
Aggregate
SAD Tagus 2004/05 H. Galhardas
Sales Sales SALES Model Year Color Sales
Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39
SAD Tagus 2004/05 H. Galhardas
Data cube for salesData cube for sales
Chevy 1990 blue 62
Chevy 1990 red 5 Chevy 1990 white 95
Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182
Model Year Color Sales
...
SAD Tagus 2004/05 H. Galhardas
Semantics of the cube Semantics of the cube operatoroperator
Aggregates over all the <select list > attributes in the Aggregates over all the <select list > attributes in the Group By clause Group By clause
Unions in each super-aggregate of the global cube, Unions in each super-aggregate of the global cube, substituting ALL for the agg. columnssubstituting ALL for the agg. columns
Ex for Sales:Ex for Sales:select model, year, color, sum(sales) select model, year, color, sum(sales) as salesas salesfrom salesfrom saleswhere model in (‘Ford’,’Chevy’)where model in (‘Ford’,’Chevy’)
and year between 1990 and 1992and year between 1990 and 1992group by cube model, year, colorgroup by cube model, year, color
SAD Tagus 2004/05 H. Galhardas
SQL to express OLAP SQL to express OLAP queries queries
The cross-tabulation can be computed using a The cross-tabulation can be computed using a collection of SQL queries, e.g.:collection of SQL queries, e.g.:
SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.state
SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year
SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.state
SAD Tagus 2004/05 H. Galhardas
The CUBE Operator The CUBE Operator (SQL:1999)(SQL:1999)
Generalizing the previous example, if there Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL are k dimensions, we have 2^k possible SQL GROUP BY GROUP BY queries on a subset of dimensions.queries on a subset of dimensions.
CUBE pid, locid, timeid BY SUM SalesCUBE pid, locid, timeid BY SUM SalesEquivalent to rolling up Sales on all eight subsets
of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:
SELECT SUM(S.sales)FROM Sales SGROUP BY grouping-list
Lots of work on optimizing the CUBE operator!
SAD Tagus 2004/05 H. Galhardas
A Sample Data CubeA Sample Data CubeTotal annual salesof TV in U.S.A.Date
Produ
ct
Cou
ntr
ysum
sum TV
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
U.S.A
Canada
Mexico
sum
SAD Tagus 2004/05 H. Galhardas
Cuboids Corresponding Cuboids Corresponding to the Cubeto the Cube
all
product date country
product,date product,country date, country
product, date, country
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D(base) cuboid
SAD Tagus 2004/05 H. Galhardas
Typical OLAP OperationsTypical OLAP Operations Roll up (drill-up):Roll up (drill-up): summarize data by climbing up hierarchy (or summarize data by climbing up hierarchy (or
by dimension reduction)by dimension reduction) Drill down (roll down):Drill down (roll down): reverse of roll-up; from higher level reverse of roll-up; from higher level
summary to lower level summary or detailed data, (or summary to lower level summary or detailed data, (or introducing new dimensions)introducing new dimensions)
Slice and dice:Slice and dice: slice performs a selection on one dimension of slice performs a selection on one dimension of the cube; dice defines a subcube by performing a selection on the cube; dice defines a subcube by performing a selection on two or more dimensionstwo or more dimensions
Pivot (rotate):Pivot (rotate): reorient the cube, visualization, 3D to series of reorient the cube, visualization, 3D to series of 2D planes2D planes
Drill across:Drill across: involving (across) more than one fact table involving (across) more than one fact table Drill through:Drill through: through the bottom level of the cube to its back- through the bottom level of the cube to its back-
end relational tables (using SQL)end relational tables (using SQL)
SAD Tagus 2004/05 H. Galhardas
Roll-up vs Cube operatorRoll-up vs Cube operator
Fact tableFact table Sales (Market_id, Product_Id, Time_Id, Sales_Amt)
Dimension TablesDimension Tables Market (Market_Id, City, State, Region)
Product (Product_Id, Name, Category, Price)
Time (Time_Id, Week, Month, Quarter)
SAD Tagus 2004/05 H. Galhardas
The fact and dimension relations can be The fact and dimension relations can be displayed in an E-R diagram, which displayed in an E-R diagram, which suggests a star and is called a suggests a star and is called a star star schemaschema
Star SchemaStar Schema
SAD Tagus 2004/05 H. Galhardas
AggregationAggregation
Many OLAP queries involve Many OLAP queries involve aggregationaggregation of the data of the data in the fact tablein the fact table
For example, to find the total sales (over time) of each For example, to find the total sales (over time) of each product in each market, we might useproduct in each market, we might use SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY S.Market_Id, S.Product_Id
The aggregation is over the entire time dimension and The aggregation is over the entire time dimension and thus produces a two-dimensional view of the datathus produces a two-dimensional view of the data
SAD Tagus 2004/05 H. Galhardas
Aggregation over TimeAggregation over Time The output of the previous queryThe output of the previous query
SUM(Sales_Amt)SUM(Sales_Amt) M1M1 M2M2 M3M3 M4M4
P1P1
P2P2
P3P3
P4P4
P5P5
Market_Id
Product_Id
SAD Tagus 2004/05 H. Galhardas
Drilling Down and Drilling Down and Rolling UpRolling Up
Some dimension tables represent an Some dimension tables represent an aggregation aggregation hierarchyhierarchy Market_Id City State Region
When we execute a series of queries that moves When we execute a series of queries that moves down a hierarchy (down a hierarchy (e.g., e.g., from aggregation over from aggregation over regions to aggregation over states) we are said to regions to aggregation over states) we are said to be be drilling down.drilling down. Requires use of the fact table or information more
specific than the requested aggregation (e.g., cities) When we move up the hierarchy – from states to When we move up the hierarchy – from states to
regions – we are said to be regions – we are said to be rolling uprolling up Agregates can be calculated from a prior query
SAD Tagus 2004/05 H. Galhardas
Drilling DownDrilling Down Drilling down on regionsDrilling down on regions
SELECT S.Product_Id, M.Region, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY M.Region, S.Product_Id
SELECT S.Product_Id, M.State, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY M.State, S.Product_Id
Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)
Market (Market_Id, City, State, Region)
SAD Tagus 2004/05 H. Galhardas
Rolling UpRolling Up Rolling up on regionsRolling up on regions
If we have already created a table, State_Sales, containing the result of
SELECT S.Product_Id, M.State, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY M.State, S.Product_Id
then we can roll up from there to:
SELECT T.Product_Id, M.Region, SUM (T.Sales_Amt)FROM State_Sales T, (SELECT DISTINCT M.Region, M.State
FROM Market M) AS R (Region, State)WHERE R.State = T.StateGROUP BY R.Region, T.Product_Id
SAD Tagus 2004/05 H. Galhardas
PivotingPivoting When we view the data as a multi-dimensional When we view the data as a multi-dimensional
cube and group on a subset of the axes, we are cube and group on a subset of the axes, we are said to be performing a pivot on those axessaid to be performing a pivot on those axes Pivoting uses GROUP BY; aggregation is used on the
remaining attributes Example: Pivot on the product and time and aggregate
on the Market_Id
Sales (Market_id, Product_Id, Time_Id, Sales_Amt)
Time (Time_Id, Week, Month, Quarter)Time (Time_Id, Week, Month, Quarter) SELECT S.Product_Id, T.Quarter, SUM (Sales_Amt) FROM Sales S, Time T WHERE T.Time_Id = S.Time_Id GROUP BY T.Quarter, S.Product_Id
SAD Tagus 2004/05 H. Galhardas
ROLLUPROLLUP
ROLLUPROLLUP is similar to is similar to CUBECUBE except that except that instead of aggregating all subsets of the instead of aggregating all subsets of the arguments, it creates subsets moving from arguments, it creates subsets moving from right to leftright to leftROLLUP is also in SQL:1999
SAD Tagus 2004/05 H. Galhardas
Example of Example of ROLLUPROLLUP OperatorOperator
SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)
FROM Sales SGROUP BY ROLLUP (S.Market_Id, S. Product_Id)
- first aggregates with the finest granularityGROUP BY S.Market_Id, S.Product_Id
- then with the next level of granularityGROUP BY S.Market_Id
- then the grand total is computed with the empty GROUP BY clause GROUP BY
SAD Tagus 2004/05 H. Galhardas
ROLLUPROLLUP Vs. Vs. CUBECUBE
By contrast, the same query with By contrast, the same query with CUBECUBE - first aggregates with the finest granularity
GROUP BY S.Market_Id, S.Product_Id
- then with the next level of granularity (both subsets)
GROUP BY S.Market_IdGROUP BY S.Product_Id
- then the grand total with
GROUP BY
SAD Tagus 2004/05 H. Galhardas
BibliografiaBibliografia
(Livro) (Livro) Data Mining: Concepts and TechniquesData Mining: Concepts and Techniques, J. Han & M. , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Secções 2.1 e 2.2)Kamber, Morgan Kaufmann, 2001 (Secções 2.1 e 2.2)
(Livro) Database Management Systems, R. Ramakrishnan & (Livro) Database Management Systems, R. Ramakrishnan & J. Gehrke, 3rd Ed. (Cap. 25)J. Gehrke, 3rd Ed. (Cap. 25)
(Artigo) (Artigo) An Overview of Data Warehousing and OLAP An Overview of Data Warehousing and OLAP TechnologyTechnology, S. Chaudhuri & U. Dayal, SIGMOD Record, , S. Chaudhuri & U. Dayal, SIGMOD Record, March 1997March 1997
(Artigo) (Artigo) Data Cube: A Relational Aggregation Operator Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-TotalsGeneralizing Group-By, Cross-Tab, and Sub-Totals, J. Gray , J. Gray et al, Tech. Report MSR-TR-97-32, 1997et al, Tech. Report MSR-TR-97-32, 1997