Top Banner

of 90

data mning by jaiwei han chapter 2

Apr 04, 2018

Download

Documents

kishore_phani
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 7/30/2019 data mning by jaiwei han chapter 2

    1/90

    December 19, 2012 Data Mining: Concepts and Techniques 1

    Data Mining:Concepts and Techniques

    Slides for Textbook Chapter 2

    Jiawei Han and Micheline Kamber

    Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

    www.cs.uiuc.edu/~hanj

  • 7/30/2019 data mning by jaiwei han chapter 2

    2/90

    December 19, 2012 Data Mining: Concepts and Techniques 2

    Chapter 2: Data Warehousing andOLAP Technology for Data Mining

    What is a data warehouse?

    A multi-dimensional data model

    Data warehouse architecture

    Data warehouse implementation

    Further development of data cube technology

    From data warehousing to data mining

  • 7/30/2019 data mning by jaiwei han chapter 2

    3/90

    December 19, 2012 Data Mining: Concepts and Techniques 3

    What is Data Warehouse?

    Defined in many different ways, but not rigorously.

    A decision support database that is maintained separately from

    the organization s operational database

    Support information processing by providing a solid platform of

    consolidated, historical data for analysis.

    A data warehouse is a subject-oriented , integrated , time-variant ,

    and nonvolatile collection of data in support of management s

    decision- making process. W. H. InmonData warehousing:

    The process of constructing and using data warehouses

  • 7/30/2019 data mning by jaiwei han chapter 2

    4/90

    December 19, 2012 Data Mining: Concepts and Techniques 4

    Data Warehouse Subject-Oriented

    Organized around major subjects, such as customer,product, sales .

    Focusing on the modeling and analysis of data for decision

    makers, not on daily operations or transaction processing.

    Provide a simple and concise view around particular subjectissues by excluding data that are not useful in the decision

    support process .

  • 7/30/2019 data mning by jaiwei han chapter 2

    5/90

    December 19, 2012 Data Mining: Concepts and Techniques 5

    Data Warehouse Integrated

    Constructed by integrating multiple, heterogeneous datasources

    relational databases, flat files, on-line transactionrecords

    Data cleaning and data integration techniques areapplied.

    Ensure consistency in naming conventions, encodingstructures, attribute measures, etc. among differentdata sources

    E.g., Hotel price: currency, tax, breakfast covered, etc.

    When data is moved to the warehouse, it isconverted.

  • 7/30/2019 data mning by jaiwei han chapter 2

    6/90

    December 19, 2012 Data Mining: Concepts and Techniques 6

    Data Warehouse Time Variant

    The time horizon for the data warehouse is significantlylonger than that of operational systems.

    Operational database: current value data.

    Data warehouse data: provide information from ahistorical perspective (e.g., past 5-10 years)

    Every key structure in the data warehouse

    Contains an element of time, explicitly or implicitly

    But the key of operational data may or may not contain time element.

  • 7/30/2019 data mning by jaiwei han chapter 2

    7/90December 19, 2012 Data Mining: Concepts and Techniques 7

    Data Warehouse Non-Volatile

    A physically separate store of data transformed from theoperational environment.

    Operational update of data does not occur in the data

    warehouse environment.Does not require transaction processing, recovery, andconcurrency control mechanisms

    Requires only two operations in data accessing:initial loading of data and access of data .

  • 7/30/2019 data mning by jaiwei han chapter 2

    8/90December 19, 2012 Data Mining: Concepts and Techniques 8

    Data Warehouse vs. Heterogeneous DBMS

    Traditional heterogeneous DB integration:Build wrappers/mediators on top of heterogeneous databases

    Query driven approach

    When a query is posed to a client site, a meta-dictionary isused to translate the query into queries appropriate forindividual heterogeneous sites involved, and the results areintegrated into a global answer set

    Complex information filtering, compete for resources

    Data warehouse: update-driven , high performanceInformation from heterogeneous sources is integrated in advanceand stored in warehouses for direct query and analysis

  • 7/30/2019 data mning by jaiwei han chapter 2

    9/90December 19, 2012 Data Mining: Concepts and Techniques 9

    Data Warehouse vs. Operational DBMS

    OLTP (on-line transaction processing)Major task of traditional relational DBMS

    Day-to-day operations: purchasing, inventory, banking,manufacturing, payroll, registration, accounting, etc.

    OLAP (on-line analytical processing)Major task of data warehouse system

    Data analysis and decision makingDistinct features (OLTP vs. OLAP):

    User and system orientation: customer vs. marketData contents: current, detailed vs. historical, consolidatedDatabase design: ER + application vs. star + subject

    View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries

  • 7/30/2019 data mning by jaiwei han chapter 2

    10/90December 19, 2012 Data Mining: Concepts and Techniques 10

    OLTP vs. OLAP

    OLTP OLAPusers clerk, IT professional knowledge worker

    function day to day operations decision support

    DB design application-oriented subject-oriented

    data current, up-to-date

    detailed, flat relationalisolated

    historical,

    summarized, multidimensionalintegrated, consolidated

    usage repetitive ad-hoc

    access read/writeindex/hash on prim. key

    lots of scans

    unit of work short, simple transaction complex query

    # records accessed tens millions

    #users thousands hundreds

    DB size 100MB-GB 100GB-TB

    metric transaction throughput query throughput, response

  • 7/30/2019 data mning by jaiwei han chapter 2

    11/90December 19, 2012 Data Mining: Concepts and Techniques 11

    Why Separate Data Warehouse?

    High performance for both systems

    DBMS tuned for OLTP: access methods, indexing, concurrencycontrol, recovery

    Warehouse tuned for OLAP: complex OLAP queries,multidimensional view, consolidation.

    Different functions and different data:

    missing data : Decision support requires historical data whichoperational DBs do not typically maintain

    data consolidation : DS requires consolidation (aggregation,summarization) of data from heterogeneous sources

    data quality : different sources typically use inconsistent datarepresentations, codes and formats which have to be reconciled

  • 7/30/2019 data mning by jaiwei han chapter 2

    12/90December 19, 2012 Data Mining: Concepts and Techniques 12

    Chapter 2: Data Warehousing and OLAPTechnology for Data Mining

    What is a data warehouse?

    A multi-dimensional data model

    Data warehouse architecture

    Data warehouse implementation

    Further development of data cube technology

    From data warehousing to data mining

  • 7/30/2019 data mning by jaiwei han chapter 2

    13/90December 19, 2012 Data Mining: Concepts and Techniques 13

    From Tables and Spreadsheets to Data Cubes

    A data warehouse is based on a multidimensional data model whichviews data in the form of a data cube

    A data cube, such as sales , allows data to be modeled and viewed inmultiple dimensions

    Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)

    Fact table contains measures (such as dollars_sold ) and keys toeach of the related dimension tables

    In data warehousing literature, an n-D base cube is called a basecuboid . The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid . The lattice of cuboidsforms a data cube.

  • 7/30/2019 data mning by jaiwei han chapter 2

    14/90December 19, 2012 Data Mining: Concepts and Techniques 14

    Cube: A Lattice of Cuboids

    all

    time item location supplier

    time,item time,location

    time,supplier

    item,location

    item,supplier

    location,supplier

    time,item,location

    time,item,supplier

    time,location,supplier

    item,location,supplier

    time, item, location, supplier

    0-D(apex) cuboid

    1-D cuboids

    2-D cuboids

    3-D cuboids

    4-D(base) cuboid

  • 7/30/2019 data mning by jaiwei han chapter 2

    15/90December 19, 2012 Data Mining: Concepts and Techniques 15

    Conceptual Modeling of Data Warehouses

    Modeling data warehouses: dimensions & measures

    Star schema : A fact table in the middle connected to aset of dimension tables

    Snowflake schema : A refinement of star schemawhere some dimensional hierarchy is normalized into aset of smaller dimension tables , forming a shapesimilar to snowflake

    Fact constellations : Multiple fact tables sharedimension tables , viewed as a collection of stars,

    therefore called galaxy schema or fact constellation

  • 7/30/2019 data mning by jaiwei han chapter 2

    16/90

  • 7/30/2019 data mning by jaiwei han chapter 2

    17/90December 19, 2012 Data Mining: Concepts and Techniques 17

    Example of Snowflake Schema

    time_keydayday_of_the_week monthquarteryear

    time

    location_keystreetcity_key

    location

    Sales Fact Table

    time_key

    item_key

    branch_key

    location_key

    units_solddollars_sold

    avg_sales

    Measures

    item_keyitem_namebrandtypesupplier_key

    item

    branch_key

    branch_namebranch_type

    branch

    supplier_keysupplier_type

    supplier

    city_keycitystate_or_provincecountry

    city

  • 7/30/2019 data mning by jaiwei han chapter 2

    18/90December 19, 2012 Data Mining: Concepts and Techniques 18

    Example of Fact Constellation

    time_keydayday_of_the_week monthquarteryear

    time

    location_keystreetcityprovince_or_statecountry

    location

    Sales Fact Table

    time_key

    item_key

    branch_key

    location_key

    units_sold

    dollars_sold

    avg_sales

    Measures

    item_keyitem_namebrandtypesupplier_type

    item

    branch_key

    branch_namebranch_type

    branch

    Shipping Fact Table

    time_key

    item_key

    shipper_keyfrom_location

    to_location

    dollars_cost

    units_shipped

    shipper_keyshipper_name

    location_keyshipper_type

    shipper

  • 7/30/2019 data mning by jaiwei han chapter 2

    19/90December 19, 2012 Data Mining: Concepts and Techniques 19

    A Data Mining Query Language: DMQL

    Cube Definition (Fact Table)define cube []:

    Dimension Definition ( Dimension Table ) define dimension as ()

    Special Case (Shared Dimension Tables)

    First time as cube definition define dimension as in cube

  • 7/30/2019 data mning by jaiwei han chapter 2

    20/90December 19, 2012 Data Mining: Concepts and Techniques 20

    Defining a Star Schema in DMQL

    define cube sales_star [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

    avg(sales_in_dollars), units_sold = count(*)

    define dimension time as (time_key, day, day_of_week,month, quarter, year)define dimension item as (item_key, item_name, brand,

    type, supplier_type)

    define dimension branch as (branch_key, branch_name,branch_type)define dimension location as (location_key, street, city,

    province_or_state, country)

  • 7/30/2019 data mning by jaiwei han chapter 2

    21/90December 19, 2012 Data Mining: Concepts and Techniques 21

    Defining a Snowflake Schema in DMQL

    define cube sales_snowflake [time, item, branch, location]:

    dollars_sold = sum(sales_in_dollars), avg_sales =avg(sales_in_dollars), units_sold = count(*)

    define dimension time as (time_key, day, day_of_week, month, quarter,year)

    define dimension item as (item_key, item_name, brand, type,supplier(supplier_key, supplier_type))

    define dimension branch as (branch_key, branch_name, branch_type)

    define dimension location as (location_key, street, city(city_key,province_or_state, country))

  • 7/30/2019 data mning by jaiwei han chapter 2

    22/90December 19, 2012 Data Mining: Concepts and Techniques 22

    Defining a Fact Constellation in DMQL

    define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales =

    avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state,

    country)define cube shipping [time, item, shipper, from_location, to_location]:

    dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube sales

    define dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location

    in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales

  • 7/30/2019 data mning by jaiwei han chapter 2

    23/90

  • 7/30/2019 data mning by jaiwei han chapter 2

    24/90

    December 19, 2012 Data Mining: Concepts and Techniques 24

    A Concept Hierarchy: Dimension (location)

    all

    Europe North_America

    MexicoCanadaSpainGermany

    Vancouver

    M. WindL. Chan

    ...

    ......

    ... ...

    ...

    all

    region

    office

    country

    TorontoFrankfurtcity

  • 7/30/2019 data mning by jaiwei han chapter 2

    25/90

    December 19, 2012 Data Mining: Concepts and Techniques 25

    View of Warehouses and Hierarchies

    Specification of hierarchiesSchema hierarchyday < {month = minsup

    MotivationOnly a small portion of cube cells may be

    above the water in a sparse cube Only calculate interesting data data

    above certain thresholdSuppose 100 dimensions, only 1 base cell.How many aggregate (non-base) cells if count >= 1? What about count >= 2?

  • 7/30/2019 data mning by jaiwei han chapter 2

    55/90

  • 7/30/2019 data mning by jaiwei han chapter 2

    56/90

  • 7/30/2019 data mning by jaiwei han chapter 2

    57/90

    December 19, 2012 Data Mining: Concepts and Techniques 57

    Drawbacks of BUC

    Requires a significant amount of memoryOn par with most other CUBE algorithms though

    Does not obtain good performance with dense CUBEs

    Overly skewed data or a bad choice of dimensionordering reduces performanceCannot compute iceberg cubes with complex measures

    CREATE CUBE Sales_Iceberg AS

    SELECT month, city, cust_grp, AVG(price), COUNT(*)

    FROM Sales_InforCUBEBY month, city, cust_grpHAVING AVG(price) >= 800 AND

    COUNT(*) >= 50

  • 7/30/2019 data mning by jaiwei han chapter 2

    58/90

    December 19, 2012 Data Mining: Concepts and Techniques 58

    Non-Anti-Monotonic Measures

    The cubing query with avg is non-anti-monotonic!

    (Mar, *, *, 600, 1800) fails the HAVING clause

    (Mar, *, Bus, 1300, 360) passes the clause

    CREATE CUBE Sales_Iceberg ASSELECT month, city, cust_grp,

    AVG(price), COUNT(*)

    FROM Sales_InforCUBEBY month, city, cust_grpHAVING AVG(price) >= 800 AND

    COUNT(*) >= 50

    Month City Cust_grp Prod Cost Price

    Jan Tor Edu Printer 500 485

    Jan Tor Hld TV 800 1200

    Jan Tor Edu Camera 1160 1280

    Feb Mon Bus Laptop 1500 2500

    Mar Van Edu HD 540 520

  • 7/30/2019 data mning by jaiwei han chapter 2

    59/90

    December 19, 2012 Data Mining: Concepts and Techniques 59

    Top-k Average

    Let (*, Van, *) cover 1,000 records Avg(price) is the average price of those 1000 sales Avg 50(price) is the average price of the top-50 sales

    (top-50 according to the sales priceTop-k average is anti-monotonic

    The top 50 sales in Van. is with avg(price)

  • 7/30/2019 data mning by jaiwei han chapter 2

    60/90

    December 19, 2012 Data Mining: Concepts and Techniques 60

    Binning for Top-k Average

    Computing top-k avg is costly with large k Binning idea

    Avg 50(c) >= 800Large value collapsing: use a sum and a countto summarize records with measure >= 800

    If count>=800, no need to check small records

    Small value binning: a group of binsOne bin covers a range, e.g., 600~800, 400~600,etc.Register a sum and a count for each bin

  • 7/30/2019 data mning by jaiwei han chapter 2

    61/90

    December 19, 2012 Data Mining: Concepts and Techniques 61

    Approximate top-k average

    Range Sum Count

    Over 800 28000 20600~800 10600 15

    400~600 15200 30

    Top 50

    Approximate avg 50()=

    (28000+10600+600*15)/50=952

    Suppose for (*, Van, *), we have

    Month City Cust_grp Prod Cost Price

    The cell may pass the HAVING clause

    Quant-info for Top-k Average

  • 7/30/2019 data mning by jaiwei han chapter 2

    62/90

    December 19, 2012 Data Mining: Concepts and Techniques 62

    Quant-info for Top-k AverageBinning

    Accumulate quant-info for cells to computeaverage iceberg cubes efficiently

    Three pieces: sum, count, top-k bins

    Use top-k bins to estimate/prune descendantsUse sum and count to consolidate current cell

    Approximate avg 50 ()

    Anti-monotonic, canbe computed

    efficiently

    real avg 50 ()

    Anti-monotonic, butcomputationally

    costly

    avg()

    Not anti-monotonic

    strongestweakest

    An Efficient Iceberg Cubing Method:

  • 7/30/2019 data mning by jaiwei han chapter 2

    63/90

    December 19, 2012 Data Mining: Concepts and Techniques 63

    An Efficient Iceberg Cubing Method:Top-k H-Cubing

    One can revise Apriori or BUC to compute a top-k avg

    iceberg cube. This leads to top-k-Apriori and top-k BUC.

    Can we compute iceberg cube more efficiently?

    Top-k H-cubing: an efficient method to compute iceberg

    cubes with average measure

    H-tree: a hyper-tree structure

    H-cubing: computing iceberg cubes using H-tree

  • 7/30/2019 data mning by jaiwei han chapter 2

    64/90

    December 19, 2012 Data Mining: Concepts and Techniques 64

    H-tree: A Prefix Hyper-tree

    Month City Cust_grp Prod Cost Price

    Jan Tor Edu Printer 500 485

    Jan Tor Hhd TV 800 1200

    Jan Tor Edu Camera 1160 1280

    Feb Mon Bus Laptop 1500 2500

    Mar Van Edu HD 540 520

    root

    edu hhd bus

    Jan Mar Jan Feb

    Tor Van Tor Mon

    Q.I.Q.I. Q.I.Quant-InfoSum: 1765Cnt: 2

    bins

    Attr. Val. Quant-Info Side-link Edu Sum:2285 Hhd Bus

    Jan Feb

    Tor Van Mon

    Headertable

  • 7/30/2019 data mning by jaiwei han chapter 2

    65/90

    December 19, 2012 Data Mining: Concepts and Techniques 65

    Properties of H-tree

    Construction cost: a single database scan

    Completeness: It contains the complete

    information needed for computing the icebergcube

    Compactness: # of nodes n*m+1

    n: # of tuples in the table

    m: # of attributes

  • 7/30/2019 data mning by jaiwei han chapter 2

    66/90

    Computing Cells Involving Month

  • 7/30/2019 data mning by jaiwei han chapter 2

    67/90

    December 19, 2012 Data Mining: Concepts and Techniques 67

    Computing Cells Involving MonthBut No City

    root

    Edu. Hhd. Bus.

    Jan. Mar. Jan. Feb.

    Tor. Van. Tor. Mont.

    Q.I.Q.I. Q.I.

    Attr. Val. Quant-Info Side-link Edu. Sum:2285 Hhd. Bus.

    Jan. Feb. Mar.

    Tor.

    Van. Mont.

    1. Roll up quant-info2. Compute cells involving

    month but no city

    Q.I.

    Top-k OK mark: if Q.I. in a child passestop-k avg threshold, so does its parents.No binning is needed!

    Computing Cells Involving Only

  • 7/30/2019 data mning by jaiwei han chapter 2

    68/90

    December 19, 2012 Data Mining: Concepts and Techniques 68

    p g g yCust_grp

    root

    edu hhd bus

    Jan Mar Jan Feb

    Tor Van Tor Mon

    Q.I.Q.I. Q.I.

    Attr. Val. Quant-Info Side-link Edu Sum:2285 Hhd Bus

    Jan Feb Mar

    Tor Van Mon

    Check header table directly

    Q.I.

  • 7/30/2019 data mning by jaiwei han chapter 2

    69/90

    Scalability w r t Count Threshold

  • 7/30/2019 data mning by jaiwei han chapter 2

    70/90

    December 19, 2012 Data Mining: Concepts and Techniques 70

    Scalability w.r.t. Count Threshold(No min_avg Setting)

    0

    50

    100

    150

    200

    250

    300

    0.00% 0.05% 0.10%Count threshold

    R u n

    t i m e

    ( s e c o n

    d ) top-k H-Cubing

    top-k BUC

    Computing Iceberg Cubes with Other

  • 7/30/2019 data mning by jaiwei han chapter 2

    71/90

    December 19, 2012 Data Mining: Concepts and Techniques 71

    Computing Iceberg Cubes with OtherComplex Measures

    Computing other complex measures

    Key point: find a function which is weaker but ensurescertain anti-monotonicity

    Examples

    Avg() v: avg k (c) v (bottom-k avg)

    Avg() v only (no count): max(price) v

    Sum(profit) (profit can be negative):p_sum(c) v if p_count(c) k; or otherwise, sum k (c) v

    Others: conjunctions of multiple conditions

  • 7/30/2019 data mning by jaiwei han chapter 2

    72/90

  • 7/30/2019 data mning by jaiwei han chapter 2

    73/90

    December 19, 2012 Data Mining: Concepts and Techniques 73

    Condensed Cube

    W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data Cube Size. ICDE 02.

    Icerberg cube cannot solve all the problems

    Suppose 100 dimensions, only 1 base cell with count = 10.

    How many aggregate (non-base) cells if count >= 10?Condensed cube

    Only need to store one cell (a 1, a 2, , a 100 , 10), whichrepresents all the corresponding aggregate cells

    Adv.Fully precomputed cube without compression

    Efficient computation of the minimal condensed cube

    Chapter 2: Data Warehousing and OLAP

  • 7/30/2019 data mning by jaiwei han chapter 2

    74/90

    December 19, 2012 Data Mining: Concepts and Techniques 74

    Chapter 2: Data Warehousing and OLAPTechnology for Data Mining

    What is a data warehouse?

    A multi-dimensional data model

    Data warehouse architecture

    Data warehouse implementation

    Further development of data cube technology

    From data warehousing to data mining

  • 7/30/2019 data mning by jaiwei han chapter 2

    75/90

    December 19, 2012 Data Mining: Concepts and Techniques 75

    Data Warehouse Usage

    Three kinds of data warehouse applicationsInformation processing

    supports querying, basic statistical analysis, and reportingusing crosstabs, tables, charts and graphs

    Analytical processing

    multidimensional analysis of data warehouse datasupports basic OLAP operations, slice-dice, drilling, pivoting

    Data miningknowledge discovery from hidden patterns

    supports associations, constructing analytical models,performing classification and prediction, and presenting themining results using visualization tools.

    Differences among the three tasks

    From On-Line Analytical Processing

  • 7/30/2019 data mning by jaiwei han chapter 2

    76/90

    December 19, 2012 Data Mining: Concepts and Techniques 76

    From On Line Analytical Processingto On Line Analytical Mining (OLAM)

    Why online analytical mining?High quality of data in data warehouses

    DW contains integrated, consistent, cleaned data Available information processing structure surrounding datawarehouses

    ODBC, OLEDB, Web accessing, service facilities, reporting andOLAP tools

    OLAP-based exploratory data analysismining with drilling, dicing, pivoting, etc.

    On-line selection of data mining functionsintegration and swapping of multiple mining functions,algorithms, and tasks.

    Architecture of OLAM

    An OLAM Architecture

  • 7/30/2019 data mning by jaiwei han chapter 2

    77/90

    December 19, 2012 Data Mining: Concepts and Techniques 77

    An OLAM Architecture

    DataWarehouse

    Meta Data

    MDDB

    OLAMEngine

    OLAPEngine

    User GUI API

    Data Cube API

    Database API

    Data cleaning

    Data integration

    Layer3

    OLAP/OLAM

    Layer2

    MDDB

    Layer1

    Data

    Repository

    Layer4

    User Interface

    Filtering&Integration Filtering

    Databases

    Mining query Mining result

    Discovery-Driven Exploration of Data

  • 7/30/2019 data mning by jaiwei han chapter 2

    78/90

    December 19, 2012 Data Mining: Concepts and Techniques 78

    y pCubes

    Hypothesis-drivenexploration by user, huge search space

    Discovery- driven (Sarawagi, et al. 98)

    Effective navigation of large OLAP data cubespre-compute measures indicating exceptions, guideuser in the data analysis, at all levels of aggregation

    Exception: significantly different from the valueanticipated, based on a statistical model

    Visual cues such as background color are used toreflect the degree of exception of each cell

    Ki d f E ti d th i C t ti

  • 7/30/2019 data mning by jaiwei han chapter 2

    79/90

    December 19, 2012 Data Mining: Concepts and Techniques 79

    Kinds of Exceptions and their Computation

    ParametersSelfExp: surprise of cell relative to other cells at samelevel of aggregationInExp: surprise beneath the cellPathExp: surprise beneath cell for each drill-downpath

    Computation of exception indicator (modeling fitting andcomputing SelfExp, InExp, and PathExp values) can beoverlapped with cube constructionException themselves can be stored, indexed andretrieved like precomputed aggregates

    E l Di D i D t C b

  • 7/30/2019 data mning by jaiwei han chapter 2

    80/90

    December 19, 2012 Data Mining: Concepts and Techniques 80

    Examples: Discovery-Driven Data Cubes

    Complex Aggregation at Multiple

  • 7/30/2019 data mning by jaiwei han chapter 2

    81/90

    December 19, 2012 Data Mining: Concepts and Techniques 81

    Complex Aggregation at MultipleGranularities: Multi-Feature Cubes

    Multi-feature cubes (Ross, et al. 1998): Compute complex queriesinvolving multiple dependent aggregates at multiple granularitiesEx. Grouping by all subsets of {item, region, month}, find themaximum price in 1997 for each group, and the total sales among allmaximum price tuples

    select item, region, month, max(price), sum(R.sales)from purchases

    where year = 1997cube by item, region, month: R

    such that R.price = max(price)Continuing the last example, among the max price tuples, find themin and max shelf live, and find the fraction of the total sales due totuple that have min shelf life within the set of all max price tuples

  • 7/30/2019 data mning by jaiwei han chapter 2

    82/90

    From Cubegrade to Multi-dimensionalC i d G di i C b

  • 7/30/2019 data mning by jaiwei han chapter 2

    83/90

    December 19, 2012 Data Mining: Concepts and Techniques 83

    Constrained Gradients in Data Cubes

    Significantly more expressive than association rulesCapture trends in user-specified measures

    Serious challenges

    Many trivial cells in a cube significance constraint to prune trivial cells

    Numerate pairs of cells probe constraint to selecta subset of cells to examine

    Only interesting changes wanted gradientconstraint to capture significant changes

    MD C t i d G di t Mi i

  • 7/30/2019 data mning by jaiwei han chapter 2

    84/90

    December 19, 2012 Data Mining: Concepts and Techniques 84

    MD Constrained Gradient Mining

    Significance constraint C sig: (cnt 100)Probe constraint C prb: (city=Van, cust_grp=busi,prod_grp=*) Gradient constraint C grad (cg, c p):

    (avg_price(c g)/avg_price(c p) 1.3)

    Dimensions Measurescid Yr City Cst_grp Prd_grp Cnt Avg_price

    c1 00 Van Busi PC 300 2100

    c2 * Van Busi PC 2800 1800

    c3 * Tor Busi PC 7900 2350

    c4 * * busi PC 58600 2250

    Base cell

    Aggregated cell

    Siblings

    Ancestor

    Probe cell: satisfied C prb (c4, c2) satisfies C grad !

    A Li S t D i Al ith

  • 7/30/2019 data mning by jaiwei han chapter 2

    85/90

    December 19, 2012 Data Mining: Concepts and Techniques 85

    A LiveSet-Driven Algorithm

    Compute probe cells using C sig and C prb The set of probe cells P is often very small

    Use probe P and constraints to find gradientsPushing selection deeplySet-oriented processing for probe cellsIceberg growing from low to high dimensionalities

    Dynamic pruning probe cells during growthIncorporating efficient iceberg cubing method

  • 7/30/2019 data mning by jaiwei han chapter 2

    86/90

    R f (I)

  • 7/30/2019 data mning by jaiwei han chapter 2

    87/90

    December 19, 2012 Data Mining: Concepts and Techniques 87

    References (I)S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan,and S. Sarawagi. On the computation of multidimensional aggregates. VLDB 96

    D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in datawarehouses. SIGMOD 97.

    R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE 97

    K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs..SIGMOD 99.

    S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, 26:65-74, 1997.

    OLAP council. MDAPI specification version 2.0. Inhttp://www.olapcouncil.org/research/apily.htm, 1998.

    G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional ConstrainedGradients in Data Cubes. VLDB 2001

    J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow,and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by,cross-tab and sub-totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.

    R f (II)

  • 7/30/2019 data mning by jaiwei han chapter 2

    88/90

    December 19, 2012 Data Mining: Concepts and Techniques 88

    References (II)

    J. Han , J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With ComplexMeasures. SIGMOD 01

    V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently.SIGMOD 96 Microsoft. OLEDB for OLAP programmer's reference version 1.0. Inhttp://www.microsoft.com/data/oledb/olap, 1998.

    K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB 97. K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiplegranularities. EDBT'98.S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP datacubes. EDBT'98.E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. JohnWiley & Sons, 1997.W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach toReducing Data Cube Size. ICDE 02.

    Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm forsimultaneous multidimensional aggregates. SIGMOD 97 .

    www cs uiuc edu/~hanj

    http://www.cs.uiuc.edu/~hanjhttp://www.cs.uiuc.edu/~hanj
  • 7/30/2019 data mning by jaiwei han chapter 2

    89/90

    December 19, 2012 Data Mining: Concepts and Techniques 89

    www.cs.uiuc.edu/~hanj

    Thank you !!!

    Work to be done

    http://www.cs.uiuc.edu/~hanjhttp://www.cs.uiuc.edu/~hanj
  • 7/30/2019 data mning by jaiwei han chapter 2

    90/90

    Work to be done

    Add MS OLAP snapshots! A tutorial on MS/OLAPReorganize cube computation materialsInto cube computation and cube exploration