Top Banner
Data Mining Data Mining Tutorial Tutorial Tomasz Imielinski Rutgers University
92

Data Mining Tutorial

Mar 18, 2016

Download

Documents

Luna

Data Mining Tutorial. Tomasz Imielinski Rutgers University. What is data mining?. Finding interesting, useful, unexpected Finding patterns, clusters, associations, classifications Answering inductive queries Aggregations and their changes on multidimensional cubes. Table of Content. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Mining Tutorial

Data Mining TutorialData Mining Tutorial

Tomasz ImielinskiRutgers University

Page 2: Data Mining Tutorial

What is data mining?

• Finding interesting, useful, unexpected • Finding patterns, clusters, associations,

classifications• Answering inductive queries • Aggregations and their changes on

multidimensional cubes

Page 3: Data Mining Tutorial

Table of Content

• Association Rules • Interesting Rules• OLAP• Cubegrades – unification of association

rules and OLAP• Classification and Clustering methods – not

included in this tutorial

Page 4: Data Mining Tutorial

Association Rules• [AIS 1993] – Agrawal, Imielinski, Swami “Mining Association Rules”

SIGMOD 1993 • [AS 1994] - Agrawal, Srikant “Fast algortihms for mining association

rules in large databases” VLDB 94• [ [B 1998] – Bayardo “Efficiently Mining Long Patterns from

databases” Sigmod 98• [SA 1996] – Srikant, Agrawal “Mining Quantitative Association Rules

in Large Relational Tables”, Sigmod 96• [T 1996] – Toivonen “Sampling Large Databases for Association

Rules”, VLDB 96• [BMS 1997] – Brin, Motwani, Silverstein “Beyond Market Baskets:

Generalizing Association Rules to Correlations”• [IV 1999] – Imielinski, Virmani “MSQL: A query language for

database mining” DMKD 1999

Page 5: Data Mining Tutorial

Baskets

• I1,…Im a set of (binary) attributes called items

• T is a database of transactions• t[k] = 1 if transaction t bought item k• Association rule X => I with support s and

confidence c• Support – what fraction of T satisfies X• Confidence – what fraction of X satisfies I

Page 6: Data Mining Tutorial

Baskets

• Minsup. Minconf• Frequent sets – sets of items X such that

their support sup(X) > minsup• If X is frequent all its subsets are (closure

downwards)

Page 7: Data Mining Tutorial

Examples• 20% of transactions which bought cereal and

milk also bought bread (support 2%)• Worst case – exponential number (in terms of

size of the set of items) of such rules.• What is the set of transactions which leads to

exponential blow up of the rule set?• Fortunately worst cases are unlikely – not

typical. Support provides excellent pruning ability.

Page 8: Data Mining Tutorial

General Strategy

• Generate frequent sets• Get association rules X=>I and their

confidence and support as s=support(X+I) and confidence c= supportX+I)/support(X)

• Key property: downward closure of the frequent sets – don’t have to consider supersets of X if X is not frequent

Page 9: Data Mining Tutorial

General strategies

• Make repetitive passes through the database of transactions

• In each pass count support of CANDIDATE frequent sets

• In the next pass continue with frequent sets obtained so far by “expanding” them. Do not expand sets which were determined NOT to be frequent

Page 10: Data Mining Tutorial

AIS Algorithm

(R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, SIGMOD’93)

Page 11: Data Mining Tutorial

AIS – generating association rules

(R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, SIGMOD’93)

Page 12: Data Mining Tutorial

AIS – estimation part

(R. Agrawal, T. Imielinski, A. Swami, “Mining Association Rules Between Sets of Items in Large Databases”, SIGMOD’93)

Page 13: Data Mining Tutorial

Apriori

(R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)

Page 14: Data Mining Tutorial

Apriori algorithm

(R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)

Page 15: Data Mining Tutorial

Pruning in apriori through self-join

(R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)

Page 16: Data Mining Tutorial

Performance improvement due to Apriori pruning

(R. Agrawal, R Srikant, “Fast Algorithms for Mining Association Rules”, VLDB’94)

Page 17: Data Mining Tutorial

Other pruning techniques• Key question: At any point of time how to

determine which extensions of a given candidate set are “worth” counting

• Apriori – only these for which all subsets are frequent

• Only these for which the estimated upper bound of the count is above minsup

• Take a risk – count a large superset of the given candidate set. If it is frequent than all its subsets are also – large saving. If not, at least we have pruned all its supersets.

Page 18: Data Mining Tutorial

Jump ahead schemes: Bayardo’s MaxMine

(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)

Page 19: Data Mining Tutorial

Jump ahead scheme

• h(g) and t(g): head and tail of an item group. Tail is the maximal set of items which g can be possibly extended with

Page 20: Data Mining Tutorial

Max-miner

(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)

Page 21: Data Mining Tutorial

Max-miner

(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)

Page 22: Data Mining Tutorial

Max-miner

(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)

Page 23: Data Mining Tutorial

Max-miner

(R. Bayardo, “Efficiently Mining Long Patterns from Databases, SIGMOD’98)

Page 24: Data Mining Tutorial

Max-miner vs Apriori vs Apriori LB

• Max-miner is over two orders of magnitude faster than apriori in identifying maximal frequent patterns on data sets with long max patterns

• Considers fewer candidate sets• Indexes only on head items• Dynamic item reordering

Page 25: Data Mining Tutorial

Quantitative Rules

• Rules which involve contignous/quantitative attributes

• Standard approach: discretize into intervals• Problem: it is arbitrary, we will miss rules• MinSup problem: if the number of intervals

is large their support will be low• MinConf problem: if intervals are large

rules may not meet min confidence

Page 26: Data Mining Tutorial

Correlation Rules [BMS 1997]• Suppose the conditional probability that the

customer buys coffee given that he buys tea is 80%, is this an important/interesting rule?

• It depends…if apriori probability of a customer buying coffee is 90%, than it is not

• Need 2x2 contingency tables rather than just pure association rules. Chi-square test for correlation rather than just support/confidence framework which can be misleading

Page 27: Data Mining Tutorial

Correlation Rules• Events A and B are independent if p(AB) = p(A) x

p(B)• If any of the AB, A(notB), (notA)B, (notA)(notB)

are dependent than AB are correlated; likewise for three items if any of the eight combinations of A, B and C are dependent then A, B, C are correlated

• I={i1,…in} is correlation rule iff the occurrences of i1,…in are correlated

• Correlation is upward closed; if S is correlated so is any superset of S

Page 28: Data Mining Tutorial

Downward vs upward closure• Downward closure (frequent sets) is a

pruning property• Upward closure – minimal correlated

itemsets, such that no subsets of them are correlated. Then finding correlation is a pruning step – prune all the parents of a correlated itemset because they are not minimal.

• Border of correlation

Page 29: Data Mining Tutorial

Pruning based on support-correlation

• Correlation can be additional pruning criterion next to support

• Unlike support/confidence where confidence is not upward closed

Page 30: Data Mining Tutorial

Chi-square

(S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)

Page 31: Data Mining Tutorial

Correlation Rules

(S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)

Page 32: Data Mining Tutorial

(S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)

Page 33: Data Mining Tutorial

Algorithms for Correlation Rules• Border can be large, exponential in terms of the size

of the item set – need better pruning functions• Support function needs to be defined but also for

negative dependencies• A set of items S has support s at the p% level if at

least p% of the cells in the contingency table for S have value s

• Problem (p<50% all items have support at the level one)

• For p > 25% at least two cells in the contingency table will have support s

Page 34: Data Mining Tutorial

Pruning…

• Antisupport (for rare events)• Prune itemsets with very high chi-square to

eliminate obvious correlations• Combine chi-squared correlation rules with

pruning via support• Itemset is significant iff it is supported and

minimally correlated

Page 35: Data Mining Tutorial

Algorithm 2-support

(S. Brin, R. Motwani, C. Silverstein, “Beyond Market Baskets: Generalizing Association Rules to Correlations”, SIGMOD’97)

INPUT: A chi-squared significance level , support s, support fraction p > 0.25. Basket data B.OUTPUT: A set of minimal correlated itemsets, from B.

1. For each item , do count O(i). We can use these values to calculate any necessary expected value.2. Initialize 3. For each pair of items such that and , do add to 4. .5. If is empty, then return SIG and terminate.3. For each itemset in , do construct the contingency table for the itemset. If less than p percent of the cells have count s, then goto Step 8.7. If the value for contingency table is at least , then add the itemset to SIG, else add the items to NOTSIG.8. Continue with the next itemset in . If there are no more itemsets in , then set to be the set of all sets S such that every subset of size |S| - 1 is

not . Goto Step 4.

Page 36: Data Mining Tutorial

Sampling Large Databases for Correlation Rules [T1996]

• Pick a random sample• Find all association rules which hold in that

sample• Verify the results with the rest of the database• Missing rules can be found in the second pass

Page 37: Data Mining Tutorial

Key idea – more detail

• Find a collection of frequent sets in the sample using lower support threshold. This collection is likely to be a superset of the frequent sets in entire database

• Concept of negative border: minimal sets which are not in a set collection S

Page 38: Data Mining Tutorial

Algorithm

(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Page 39: Data Mining Tutorial

Second pass

• Negative border consists of the “closest” itemsets which can be frequent too

• These have to be tried (measured)

Page 40: Data Mining Tutorial

(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Page 41: Data Mining Tutorial

Probabilty that a sample s has exactly c rows that contain X

(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Page 42: Data Mining Tutorial

Bounding error

(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Page 43: Data Mining Tutorial

Approximate mining

(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Page 44: Data Mining Tutorial

Approximate mining

(H. Toivonen, “Sampling Large Databases for Association Rules”, VLDB’96)

Page 45: Data Mining Tutorial

Summary

• Discover all frequent sets in one pass in a fraction of 1-D of the cases when D is given by the user; missing sets may be found in second pass

Page 46: Data Mining Tutorial

Rules and what’s next?

• Querying rules• Embedding rules in applications (API)

Page 47: Data Mining Tutorial

MSQL

(T. Imielinski, A. Virmani, “MSQL: A Query Language for Database Mining”, Data Mining and Knowledge Discovery 3, 99)

Page 48: Data Mining Tutorial

MSQL

(T. Imielinski, A. Virmani, “MSQL: A Query Language for Database Mining”, Data Mining and Knowledge Discovery 3, 99)

Page 49: Data Mining Tutorial

Applications with embedded rules (what are rules good for)

• Typicality• Characteristic of• Changing patterns• Best N• What if• Prediction• Classification

Page 50: Data Mining Tutorial

OLAP

• Multidimensional queries• Dimensions• Measures• Cubes

Page 51: Data Mining Tutorial

Data CUBE

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 52: Data Mining Tutorial

Data Cube

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 53: Data Mining Tutorial

Data Cube

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 54: Data Mining Tutorial

Data Cube

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 55: Data Mining Tutorial

Data Cube

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 56: Data Mining Tutorial

Data Cube

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 57: Data Mining Tutorial

Data Cube

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 58: Data Mining Tutorial

Measure Properties

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 59: Data Mining Tutorial

Measure Properties

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 60: Data Mining Tutorial

Measure Properties

(J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Vankatrao, “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals”, Data Mining and Knowledge Discovery 1, 1997)

Page 61: Data Mining Tutorial

Monotonicty

• Iceberg Queries• COUNT, MAX, SUM etc allow pruning• AVG does not – AVG of a cube extension

can be larger or smaller than the AVG over the original cube: thus no pruning in the apriori sense

Page 62: Data Mining Tutorial

Examples of Monotonic Conditions

• MAX, MIN• TOP-k AVG

Page 63: Data Mining Tutorial

Cubegrades: combining OLAP and association rules

• Consider rule: milk, butter=> bread [s:100, C:75%].• Consider it as a gradient or derivative of a cube. • Body: 2d-cube in multidimensional space representing

transactions where milk and butter are bought together.• Consequent: Represents the specialization of “body” cube

by bread. “Body+consequent” represents subcube where milk, butter and bread are bought together.

• Support: COUNT of records in body cube.• Confidence: measures how COUNT is affected when we

specialize “body” cube by “consequent”.

Page 64: Data Mining Tutorial

A Different Perspective• Consider rule: milk, butter=> bread [s:100, C:75%].• Consider it as a gradient or derivative of a cube. • Body: 2d-cube in multidimensional space representing

transactions where milk and butter are bought together.• Consequent: Represents the specialization of “body” cube

by bread. “Body+consequent” represents subcube where milk, butter and bread are bought together.

• Support: COUNT of records in body cube.• Confidence: measures how COUNT is affected when we

specialize “body” cube by “consequent”.

Page 65: Data Mining Tutorial

Cubegrades: Generalization of Association Rules

• We can generalize this in two ways:.– Allow additional operators for cube transformation

including specializations, generalization and mutations.– Allow additional measures such as MIN, MAX, SUM,

etc.• Result=Cubegrades

– entities that describe how transforming source cube X to target cube Y affects a set of measure values.

Page 66: Data Mining Tutorial

Mathematical Similarity

• Similar to function gradient: measures how changes in function argument affects the function value.

• Cubegrade measures how changes in cube affects measure (function) values.

Page 67: Data Mining Tutorial

Using cubegrades: Examples

• Data description: Monthly summaries of item sales per customer + customer demographics.

• Examples:– How is the average amount of milk bought affected

by different age categories among buyers of cereals?– What factors cause the average amount of milk

bought to increase by more than 25% among suburban buyer?

– How do buyers in rural cubes compare with buyers in suburban cubes in terms of the average amount spent on bread milk and cereal?

Page 68: Data Mining Tutorial

Cubegrade lingo• Consider the following cube:

areaType=‘urban’, Age=[25,35] (Avg(salesMilk)=25)• Descriptor: attribute-value pair.• K-Conjunct: Conjunct of k-descriptors• Cube: set of objects in a database that satisfy the k-

conjunct. • Dimensions: The attributes used in the descriptor.• Measures: Attributes that are aggregated over

objects.

Page 69: Data Mining Tutorial

Cubegrade Definition• Mathematically, a cubegrade is a 5-tuple <Source,

Target, Measures, Values, Delta-Value>:– Source: The source or initial cube.– Target: Target cube obtained by applying factor F

on source. Target= Source + Factor.– Measures: set of measures evaluated.– Values: function evaluating a measure in source.– Delta-Value: function evaluating the ratio of

measure value in target cube versus measure value in source cube.

Page 70: Data Mining Tutorial

Cubegrade Example:

Source cube

areaType=‘urban’->areaType=‘urban’, Age=[25,35]

(Avg(salesMilk), Avg(salesMilk)=25, DeltaAvg(salesMilk)=125%)

Target cube

Value

Measure

Delta Value

Page 71: Data Mining Tutorial

Types of cubegrades

A=a1, B=b1, C=c1

A=a1, B=b1

Generalize on C

A=a1, B=b1, C=c2Mutate C to c2

A=a1, B=b1, C=c1, D=d1

Specialize by D

Page 72: Data Mining Tutorial

Querying cubegrades.

• CubeQL (for querying cubes) and CubegradeQL(for querying cubegrades).

• Features:.– SQL-like, declarative style.– Conditions on Source cube and target cube.– Conditions on measure values and delta values.– Join conditions between source and target.

Page 73: Data Mining Tutorial

How, which and what

(A. Abdulgani, Ph.D. Thesis, Ruthers University 2000)

Page 74: Data Mining Tutorial

The Challenge

• Pruning was what made association rules practical.• Computation was bottom-up. If a cube doesn’t satisfy

the support threshold, no subcube would satisfy the support threshold.

• COUNT is no longer the sole constraint. New additional constraints.

Page 75: Data Mining Tutorial

Assumptions

• Dealing with the SQL aggregate measures MIN,MAX, SUM, AVG.

• Each constraint is of the form AGG(X)[>,<,=] c, where c is a constant.

Page 76: Data Mining Tutorial

Monotonicity• Consider a query Q, a database D and a cube X in D.• Query Q is monotonic if the condition:

Q(X’) is FALSE in D, where X’XQ(X) is FALSE in D

Page 77: Data Mining Tutorial

View Monotonicity• Alternatively, define a cube’s view as projection of the

measure and dimension values holding on the cube.• A view is not tied to a particular cube or database.• Q is monotonic for view V, if the condition

For any cube X in any D s.t. V is a view for X, Q(X) is FALSE

Q(X’) is FALSE, where X’ X

Page 78: Data Mining Tutorial

GBP Sketch

• Grid Construction for input query– Axes defined on

dimension/measure attributes used in query.

– Axis intervals based on constants used in query.

– Cartesian product of intervals define individual cells.

– Query evaluation for each cell.

AVG(X)

F

T

F T

F

F

T

T

T

MAX(X)

50

150

0 25 50

Page 79: Data Mining Tutorial

Checking for satisfiability

• Cell C defined by– mL MIN(A) mH

– ML MAX(A) MH

– AL AVG(A) AH

– SL SUM(A) SH

– CL COUNT() CH

• Reduce to the system:– (N-1)mL+ML S (N-

1)MH+mH

– SL S SH

– ALN S AHN– CL N CH

.

Solve for N and check the interval returned for N. For measures on multiple attributes solve independently for distinct attributes. Check for a common shared interval for N.

Page 80: Data Mining Tutorial

View Reachability

AVG(X)

F

F T

F

F

T

T

T

MAX(X)

50

150

0 25 50

V•

Question: Is there a cube X with view V s.t. X has a subcube which falls in a TRUE cell?

Is a TRUE cell C reachable from V?

T

Page 81: Data Mining Tutorial

Defining View Reachability

• A view V defined by:– MIN(A)=m– MAX(A)=M– AVG(A)=a– SUM(A)=– COUNT(A)=c

• A cell C defined by:– mLMIN(A) mH

– ML MAX(A) MH

– AL AVG(A) AH

– SL SUM(A) SH

– CL COUNT() CH

Cell C is reachable from view V if there is a set X of {X1, X2, .. XN, .. XC} real elements which satisfies the view constraints and a subset X’ of {X1, X2, .. XN} which satisfies the cell constraints.

Page 82: Data Mining Tutorial

Checking for View Reachability• View Reachability on measures of a single

attribute can be reduced to at most 4 systems with constant number of linear constraints on N.

• For measures on multiple distinct attributes, obtain set of intervals on every attribute separately. V is reachable from C if there is a shared interval obtained on N containing an integral point.

Page 83: Data Mining Tutorial

Example• Consider view of 19 records X={X1, …, X19} with:

– MIN(X)=0, MAX(X)=75, SUM(X)=1000.• Let C be defined by

– [CL, CH]=[1, 19], [mL, mH]=[0,10], [ML, MH]=[0,50], [AL, AH]=[46.5, 50].

• C is reachable from V either with N=12 or with N=15.

Page 84: Data Mining Tutorial

Complexity Analysis

• Let Q be a query in disjunctive normal form consisting of m conjuncts in J dimensions and K distinct measure attributes.

• The monotonicity of Q for a given view can be tested in O(m(J+KlogK)) time.

Page 85: Data Mining Tutorial

Computing cubegrades

Algorithm Cubegrade Gen Basic:• Evaluate Q[source];• For each S in Q[source]

– Evaluate Q[S];– For each T in Q[S]

• Form the cubegrade <S, T, Measure, Values, Delta Values> where Delta Values have to be calculated as ratios of the Measure evaluated on the target and on the source cubes respectively.

Page 86: Data Mining Tutorial

Cube and Cubegrade query classes• Cube Query classification:

– Queries with strong monotonicity.– Queries with weak monotonicity.– Hopeless queries.

• Cubegrade query classification, based on source cube query classification and target cube classification:– Focused.– Weakly focused.– Hopeless.

Page 87: Data Mining Tutorial

Cubegrade Application Development

• Cubegrades are not end products. Rather, an investment to drive a set of applications.

• Definition of an application framework for cubegrades. Features include:– Extension of Dmajor datamining platform.– Generation, storage and retrieval of cubegrades.– Accessing internal components of cubegrades for browsing,

comparisons and modifications.– Traversals through a set of cubegrades.– Primitives for correlating cubegrades with underlying data

and vice versa.

Page 88: Data Mining Tutorial

Application Example: Effective Factors

• Find factors which are effective in changing a measure value m for a collection of cubes by a significant ratio.

• Factor F is effective for C iff for all G=<C’,C’+F,m,V,Delta> where C’ C it holds that Delta(m)>(1+x) or Delta(m)<(1-x).

Page 89: Data Mining Tutorial

Cubegrades and OLAP

  Traditional OLAP Cubegrades

Scope Static multi-dimensional object.[GBLP96]

Dynamic multi-dimensional object

Query Type

Query cubes [CT98, GL98]; mostly structural querying

Query cubegrades; Structural and value querying

Query Evaluation

Static top-down precomputation [AAD96, RS97]

Dynamic bottom-up computations. Novel pruning method

Page 90: Data Mining Tutorial

Future work

• Extending GBP to cover additional constraint types.• Monotonicity threshold of a query.• Domain Specific Application: Gene Expression Mining.

Page 91: Data Mining Tutorial

Summary

• Cubegrade concept as a generalization of association rules and cubes.

• Concept of querying of cubes and cubegrades. • Description of a GBP method for efficient pruning of queries

with constraints of type Agg(a) {,} c, where Agg() can be MIN(), MAX(), SUM(), AVG().

• Experimentally through a cubegrade engine prototype shown the viability of GBP and the cubegrade generation process.

• Classification of a hierarchy of query classes based on theoretical pruning characteristics.

• Presentation of a framework for developing cubegrade applications.

Page 92: Data Mining Tutorial

Conclusions• OLAP and Association rules – really one approach• Key problem - the set of rules, cubegrades – can

be orders of magnitude larger than the source data set

• Hence, the key issue is how do we present/use the obtained rules in applications which provide real value for the user

• Discovery as querying