data mning by jaiwei han chapter 2

7/30/2019 data mning by jaiwei han chapter 2

1/90

December 19, 2012 Data Mining: Concepts and Techniques 1

Data Mining:Concepts and Techniques

Slides for Textbook Chapter 2

Jiawei Han and Micheline Kamber

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

www.cs.uiuc.edu/~hanj


2/90


Chapter 2: Data Warehousing andOLAP Technology for Data Mining

What is a data warehouse?

A multi-dimensional data model

Data warehouse architecture

Data warehouse implementation

Further development of data cube technology

From data warehousing to data mining


3/90


What is Data Warehouse?

Defined in many different ways, but not rigorously.

A decision support database that is maintained separately from

the organization s operational database

Support information processing by providing a solid platform of

consolidated, historical data for analysis.

A data warehouse is a subject-oriented , integrated , time-variant ,

and nonvolatile collection of data in support of management s

decision- making process. W. H. InmonData warehousing:

The process of constructing and using data warehouses


4/90


Data Warehouse Subject-Oriented

Organized around major subjects, such as customer,product, sales .

Focusing on the modeling and analysis of data for decision

makers, not on daily operations or transaction processing.

Provide a simple and concise view around particular subjectissues by excluding data that are not useful in the decision

support process .


5/90


Data Warehouse Integrated

Constructed by integrating multiple, heterogeneous datasources

relational databases, flat files, on-line transactionrecords

Data cleaning and data integration techniques areapplied.

Ensure consistency in naming conventions, encodingstructures, attribute measures, etc. among differentdata sources

E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it isconverted.


6/90


Data Warehouse Time Variant

The time horizon for the data warehouse is significantlylonger than that of operational systems.

Operational database: current value data.

Data warehouse data: provide information from ahistorical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain time element.


7/90December 19, 2012 Data Mining: Concepts and Techniques 7

Data Warehouse Non-Volatile

A physically separate store of data transformed from theoperational environment.

Operational update of data does not occur in the data

warehouse environment.Does not require transaction processing, recovery, andconcurrency control mechanisms

Requires only two operations in data accessing:initial loading of data and access of data .



Data Warehouse vs. Heterogeneous DBMS

Traditional heterogeneous DB integration:Build wrappers/mediators on top of heterogeneous databases

Query driven approach

When a query is posed to a client site, a meta-dictionary isused to translate the query into queries appropriate forindividual heterogeneous sites involved, and the results areintegrated into a global answer set

Complex information filtering, compete for resources

Data warehouse: update-driven , high performanceInformation from heterogeneous sources is integrated in advanceand stored in warehouses for direct query and analysis



Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing)Major task of traditional relational DBMS

Day-to-day operations: purchasing, inventory, banking,manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)Major task of data warehouse system

Data analysis and decision makingDistinct features (OLTP vs. OLAP):

User and system orientation: customer vs. marketData contents: current, detailed vs. historical, consolidatedDatabase design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries



OLTP vs. OLAP

OLTP OLAPusers clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date

detailed, flat relationalisolated

historical,

summarized, multidimensionalintegrated, consolidated

usage repetitive ad-hoc

access read/writeindex/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response



Why Separate Data Warehouse?

High performance for both systems

DBMS tuned for OLTP: access methods, indexing, concurrencycontrol, recovery

Warehouse tuned for OLAP: complex OLAP queries,multidimensional view, consolidation.

Different functions and different data:

missing data : Decision support requires historical data whichoperational DBs do not typically maintain

data consolidation : DS requires consolidation (aggregation,summarization) of data from heterogeneous sources

data quality : different sources typically use inconsistent datarepresentations, codes and formats which have to be reconciled



Chapter 2: Data Warehousing and OLAPTechnology for Data Mining









From Tables and Spreadsheets to Data Cubes

A data warehouse is based on a multidimensional data model whichviews data in the form of a data cube

A data cube, such as sales , allows data to be modeled and viewed inmultiple dimensions

Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold ) and keys toeach of the related dimension tables

In data warehousing literature, an n-D base cube is called a basecuboid . The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid . The lattice of cuboidsforms a data cube.



Cube: A Lattice of Cuboids

all

time item location supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid



Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures

Star schema : A fact table in the middle connected to aset of dimension tables

Snowflake schema : A refinement of star schemawhere some dimensional hierarchy is normalized into aset of smaller dimension tables , forming a shapesimilar to snowflake

Fact constellations : Multiple fact tables sharedimension tables , viewed as a collection of stars,

therefore called galaxy schema or fact constellation


16/90



Example of Snowflake Schema

time_keydayday_of_the_week monthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_solddollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_key

branch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycitystate_or_provincecountry

city



Example of Fact Constellation

time_keydayday_of_the_week monthquarteryear

time

location_keystreetcityprovince_or_statecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_key

branch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_keyfrom_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_name

location_keyshipper_type

shipper



A Data Mining Query Language: DMQL

Cube Definition (Fact Table)define cube []:

Dimension Definition ( Dimension Table ) define dimension as ()

Special Case (Shared Dimension Tables)

First time as cube definition define dimension as in cube



Defining a Star Schema in DMQL

define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week,month, quarter, year)define dimension item as (item_key, item_name, brand,

type, supplier_type)

define dimension branch as (branch_key, branch_name,branch_type)define dimension location as (location_key, street, city,

province_or_state, country)



Defining a Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales =avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter,year)

define dimension item as (item_key, item_name, brand, type,supplier(supplier_key, supplier_type))

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city(city_key,province_or_state, country))



Defining a Fact Constellation in DMQL

define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state,

country)define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube sales

define dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location

in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales


23/90


24/90


A Concept Hierarchy: Dimension (location)

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity


25/90


View of Warehouses and Hierarchies

Specification of hierarchiesSchema hierarchyday < {month = minsup

MotivationOnly a small portion of cube cells may be

above the water in a sparse cube Only calculate interesting data data

above certain thresholdSuppose 100 dimensions, only 1 base cell.How many aggregate (non-base) cells if count >= 1? What about count >= 2?


55/90


56/90


57/90


Drawbacks of BUC

Requires a significant amount of memoryOn par with most other CUBE algorithms though

Does not obtain good performance with dense CUBEs

Overly skewed data or a bad choice of dimensionordering reduces performanceCannot compute iceberg cubes with complex measures

CREATE CUBE Sales_Iceberg AS

SELECT month, city, cust_grp, AVG(price), COUNT(*)

FROM Sales_InforCUBEBY month, city, cust_grpHAVING AVG(price) >= 800 AND

COUNT(*) >= 50


58/90


Non-Anti-Monotonic Measures

The cubing query with avg is non-anti-monotonic!

(Mar, *, *, 600, 1800) fails the HAVING clause

(Mar, *, Bus, 1300, 360) passes the clause

CREATE CUBE Sales_Iceberg ASSELECT month, city, cust_grp,

AVG(price), COUNT(*)

FROM Sales_InforCUBEBY month, city, cust_grpHAVING AVG(price) >= 800 AND

COUNT(*) >= 50

Month City Cust_grp Prod Cost Price

Jan Tor Edu Printer 500 485

Jan Tor Hld TV 800 1200

Jan Tor Edu Camera 1160 1280

Feb Mon Bus Laptop 1500 2500

Mar Van Edu HD 540 520


59/90


Top-k Average

Let (*, Van, *) cover 1,000 records Avg(price) is the average price of those 1000 sales Avg 50(price) is the average price of the top-50 sales

(top-50 according to the sales priceTop-k average is anti-monotonic

The top 50 sales in Van. is with avg(price)


60/90


Binning for Top-k Average

Computing top-k avg is costly with large k Binning idea

Avg 50(c) >= 800Large value collapsing: use a sum and a countto summarize records with measure >= 800

If count>=800, no need to check small records

Small value binning: a group of binsOne bin covers a range, e.g., 600~800, 400~600,etc.Register a sum and a count for each bin


61/90


Approximate top-k average

Range Sum Count

Over 800 28000 20600~800 10600 15

400~600 15200 30

Top 50

Approximate avg 50()=

(28000+10600+600*15)/50=952

Suppose for (*, Van, *), we have


The cell may pass the HAVING clause

Quant-info for Top-k Average


62/90


Quant-info for Top-k AverageBinning

Accumulate quant-info for cells to computeaverage iceberg cubes efficiently

Three pieces: sum, count, top-k bins

Use top-k bins to estimate/prune descendantsUse sum and count to consolidate current cell

Approximate avg 50 ()

Anti-monotonic, canbe computed

efficiently

real avg 50 ()

Anti-monotonic, butcomputationally

costly

avg()

Not anti-monotonic

strongestweakest

An Efficient Iceberg Cubing Method:


63/90


An Efficient Iceberg Cubing Method:Top-k H-Cubing

One can revise Apriori or BUC to compute a top-k avg

iceberg cube. This leads to top-k-Apriori and top-k BUC.

Can we compute iceberg cube more efficiently?

Top-k H-cubing: an efficient method to compute iceberg

cubes with average measure

H-tree: a hyper-tree structure

H-cubing: computing iceberg cubes using H-tree


64/90


H-tree: A Prefix Hyper-tree


Jan Tor Edu Printer 500 485

Jan Tor Hhd TV 800 1200

Jan Tor Edu Camera 1160 1280

Feb Mon Bus Laptop 1500 2500

Mar Van Edu HD 540 520

root

edu hhd bus

Jan Mar Jan Feb

Tor Van Tor Mon

Q.I.Q.I. Q.I.Quant-InfoSum: 1765Cnt: 2

bins

Attr. Val. Quant-Info Side-link Edu Sum:2285 Hhd Bus

Jan Feb

Tor Van Mon

Headertable


65/90


Properties of H-tree

Construction cost: a single database scan

Completeness: It contains the complete

information needed for computing the icebergcube

Compactness: # of nodes n*m+1

n: # of tuples in the table

m: # of attributes


66/90

Computing Cells Involving Month


67/90


Computing Cells Involving MonthBut No City

root

Edu. Hhd. Bus.

Jan. Mar. Jan. Feb.

Tor. Van. Tor. Mont.

Q.I.Q.I. Q.I.

Attr. Val. Quant-Info Side-link Edu. Sum:2285 Hhd. Bus.

Jan. Feb. Mar.

Tor.

Van. Mont.

1. Roll up quant-info2. Compute cells involving

month but no city

Q.I.

Top-k OK mark: if Q.I. in a child passestop-k avg threshold, so does its parents.No binning is needed!

Computing Cells Involving Only


68/90


p g g yCust_grp

root

edu hhd bus

Jan Mar Jan Feb

Tor Van Tor Mon

Q.I.Q.I. Q.I.

Attr. Val. Quant-Info Side-link Edu Sum:2285 Hhd Bus

Jan Feb Mar

Tor Van Mon

Check header table directly

Q.I.


69/90

Scalability w r t Count Threshold


70/90


Scalability w.r.t. Count Threshold(No min_avg Setting)

0

50

100

150

200

250

300

0.00% 0.05% 0.10%Count threshold

R u n

t i m e

( s e c o n

d ) top-k H-Cubing

top-k BUC

Computing Iceberg Cubes with Other


71/90


Computing Iceberg Cubes with OtherComplex Measures

Computing other complex measures

Key point: find a function which is weaker but ensurescertain anti-monotonicity

Examples

Avg() v: avg k (c) v (bottom-k avg)

Avg() v only (no count): max(price) v

Sum(profit) (profit can be negative):p_sum(c) v if p_count(c) k; or otherwise, sum k (c) v

Others: conjunctions of multiple conditions


72/90


73/90


Condensed Cube

W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE 02.

Icerberg cube cannot solve all the problems

Suppose 100 dimensions, only 1 base cell with count = 10.

How many aggregate (non-base) cells if count >= 10?Condensed cube

Only need to store one cell (a 1, a 2, , a 100 , 10), whichrepresents all the corresponding aggregate cells

Adv.Fully precomputed cube without compression

Efficient computation of the minimal condensed cube

Chapter 2: Data Warehousing and OLAP


74/90


Chapter 2: Data Warehousing and OLAPTechnology for Data Mining








75/90


Data Warehouse Usage

Three kinds of data warehouse applicationsInformation processing

supports querying, basic statistical analysis, and reportingusing crosstabs, tables, charts and graphs

Analytical processing

multidimensional analysis of data warehouse datasupports basic OLAP operations, slice-dice, drilling, pivoting

Data miningknowledge discovery from hidden patterns

supports associations, constructing analytical models,performing classification and prediction, and presenting themining results using visualization tools.

Differences among the three tasks

From On-Line Analytical Processing


76/90


From On Line Analytical Processingto On Line Analytical Mining (OLAM)

Why online analytical mining?High quality of data in data warehouses

DW contains integrated, consistent, cleaned data Available information processing structure surrounding datawarehouses

ODBC, OLEDB, Web accessing, service facilities, reporting andOLAP tools

OLAP-based exploratory data analysismining with drilling, dicing, pivoting, etc.

On-line selection of data mining functionsintegration and swapping of multiple mining functions,algorithms, and tasks.

Architecture of OLAM

An OLAM Architecture


77/90


An OLAM Architecture

DataWarehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data

Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Discovery-Driven Exploration of Data


78/90


y pCubes

Hypothesis-drivenexploration by user, huge search space

Discovery- driven (Sarawagi, et al. 98)

Effective navigation of large OLAP data cubespre-compute measures indicating exceptions, guideuser in the data analysis, at all levels of aggregation

Exception: significantly different from the valueanticipated, based on a statistical model

Visual cues such as background color are used toreflect the degree of exception of each cell

Ki d f E ti d th i C t ti


79/90


Kinds of Exceptions and their Computation

ParametersSelfExp: surprise of cell relative to other cells at samelevel of aggregationInExp: surprise beneath the cellPathExp: surprise beneath cell for each drill-downpath

Computation of exception indicator (modeling fitting andcomputing SelfExp, InExp, and PathExp values) can beoverlapped with cube constructionException themselves can be stored, indexed andretrieved like precomputed aggregates

E l Di D i D t C b


80/90


Examples: Discovery-Driven Data Cubes

Complex Aggregation at Multiple


81/90


Complex Aggregation at MultipleGranularities: Multi-Feature Cubes

Multi-feature cubes (Ross, et al. 1998): Compute complex queriesinvolving multiple dependent aggregates at multiple granularitiesEx. Grouping by all subsets of {item, region, month}, find themaximum price in 1997 for each group, and the total sales among allmaximum price tuples

select item, region, month, max(price), sum(R.sales)from purchases

where year = 1997cube by item, region, month: R

such that R.price = max(price)Continuing the last example, among the max price tuples, find themin and max shelf live, and find the fraction of the total sales due totuple that have min shelf life within the set of all max price tuples


82/90

From Cubegrade to Multi-dimensionalC i d G di i C b


83/90


Constrained Gradients in Data Cubes

Significantly more expressive than association rulesCapture trends in user-specified measures

Serious challenges

Many trivial cells in a cube significance constraint to prune trivial cells

Numerate pairs of cells probe constraint to selecta subset of cells to examine

Only interesting changes wanted gradientconstraint to capture significant changes

MD C t i d G di t Mi i


84/90


MD Constrained Gradient Mining

Significance constraint C sig: (cnt 100)Probe constraint C prb: (city=Van, cust_grp=busi,prod_grp=*) Gradient constraint C grad (cg, c p):

(avg_price(c g)/avg_price(c p) 1.3)

Dimensions Measurescid Yr City Cst_grp Prd_grp Cnt Avg_price

c1 00 Van Busi PC 300 2100

c2 * Van Busi PC 2800 1800

c3 * Tor Busi PC 7900 2350

c4 * * busi PC 58600 2250

Base cell

Aggregated cell

Siblings

Ancestor

Probe cell: satisfied C prb (c4, c2) satisfies C grad !

A Li S t D i Al ith


85/90


A LiveSet-Driven Algorithm

Compute probe cells using C sig and C prb The set of probe cells P is often very small

Use probe P and constraints to find gradientsPushing selection deeplySet-oriented processing for probe cellsIceberg growing from low to high dimensionalities

Dynamic pruning probe cells during growthIncorporating efficient iceberg cubing method


86/90

R f (I)


87/90


References (I)S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan,and S. Sarawagi. On the computation of multidimensional aggregates. VLDB 96

D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in datawarehouses. SIGMOD 97.

R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE 97

K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs..SIGMOD 99.

S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997.

OLAP council. MDAPI specification version 2.0. Inhttp://www.olapcouncil.org/research/apily.htm, 1998.

G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional ConstrainedGradients in Data Cubes. VLDB 2001

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow,and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by,cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.

R f (II)


88/90


References (II)

J. Han , J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With ComplexMeasures. SIGMOD 01

V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.SIGMOD 96 Microsoft. OLEDB for OLAP programmer's reference version 1.0. Inhttp://www.microsoft.com/data/oledb/olap, 1998.

K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB 97. K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiplegranularities. EDBT'98.S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP datacubes. EDBT'98.E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. JohnWiley & Sons, 1997.W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach toReducing Data Cube Size. ICDE 02.

Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm forsimultaneous multidimensional aggregates. SIGMOD 97 .

www cs uiuc edu/~hanj
http://www.cs.uiuc.edu/~hanjhttp://www.cs.uiuc.edu/~hanj


89/90


www.cs.uiuc.edu/~hanj

Thank you !!!

Work to be done
http://www.cs.uiuc.edu/~hanjhttp://www.cs.uiuc.edu/~hanj


90/90

Work to be done

Add MS OLAP snapshots! A tutorial on MS/OLAPReorganize cube computation materialsInto cube computation and cube exploration

data mning by jaiwei han chapter 2

Documents