Top Banner
Copyright Jiawei Han, mod ified by Charles Ling for 1 Course Outline Introduction Data warehousing and OLAP Data preprocessing for mining and warehousing Concept description: characterization and discrimination Classification and prediction Association analysis Clustering analysis Mining complex data and advanced mining techniques Trends and research issues
44

1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Mar 28, 2015

Download

Documents

Korey Pryer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

1

Course Outline

Introduction Data warehousing and OLAP Data preprocessing for mining and warehousing Concept description: characterization and

discrimination Classification and prediction Association analysis Clustering analysis Mining complex data and advanced mining

techniques Trends and research issues

Page 2: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

2

Data Mining and Warehousing: Session 7

Clustering Analysis

Page 3: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

3

Clustering analysis

What is Clustering Analysis?

Clustering in Data Mining Applications

Handling Different Types of Variables

Major Clustering Techniques

Outlier Discovery

Problems and Challenges

Page 4: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

5

What Is Clustering ?

Clustering is a process of partitioning a set of data (or objects)

into a set of meaningful sub-classes, called clusters.

May help users understand the natural grouping or

structure in a data set.

Cluster: a collection of data objects that are “similar” to one

another and thus can be treated collectively as one group.

Clustering: unsupervised classification: no predefined classes.

Used either as a stand-alone tool to get insight into data

distribution or as a preprocessing step for other algorithms.

Page 5: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

6

What Is Good Clustering?

A good clustering method will produce high quality

clusters in which:

the intra-class (that is, intraintra-cluster) similarity is high.

the inter-class similarity is low.

The quality of a clustering result also depends on both the

similarity measure used by the method and its

implementation.

The quality of a clustering method is also measured by its

ability to discover some or all of the hidden patterns.

Page 6: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

7

Requirements of Clustering in Data Mining

Scalability

Dealing with different types of attributes

Discovery of clusters with arbitrary shape

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

Interpretability and usability.

Page 7: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

8

Clustering analysis

What is Clustering Analysis?

Clustering in Data Mining Applications

Handling Different Types of Variables

Major Clustering Techniques

Outlier Discovery

Problems and Challenges

Page 8: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

9

Applications of Clustering

Clustering has wide applications in

Pattern Recognition

Spatial Data Analysis:

– create thematic maps in GIS by clustering feature spaces

– detect spatial clusters and explain them in spatial data mining.

Image Processing

Economic Science (especially market research)

WWW:

– Document classification

– Cluster Weblog data to discover groups of similar access patterns

Page 9: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

10

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

Land use: Identification of areas of similar land use in an earth observation database.

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost.

City-planning: Identifying groups of houses according to their house type, value, and geographical location.

Page 10: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

11

Clustering analysis

What is Clustering Analysis?

Clustering in Data Mining Applications

Handling Different Types of Variables

Major Clustering Techniques

Outlier Discovery

Problems and Challenges

Page 11: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

12

Similarity and Dissimilarity Between Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects.

Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional

data objects, and q is a positive integer.

If q = 1, d is Manhattan distance.

If q = 2, d is Euclidean distance:)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

||...||||),(2211 pp jxixjxixjxixjid

Page 12: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

13

Measure Similarity

The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables.

Values should be scaled (normalized to 0-1) Weights should be associated with different variables based

on applications and data semantics. It is hard to define “similar enough” or “good enough”

the answer is typically highly subjective.

Page 13: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

14

Binary, Nominal, Continuous variables

Binary variable: d = 0 of x=y; d=0 otherwise

Nominal variables: > 2 states, e.g., red, yellow, blue, green. Simple matching: u: # of matches, p: total # of variables.

Also, one can use a large number of binary variables.

Continuos variables: d = |x-y| Scaling and normalization

pupjid ),(

Page 14: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

15

Clustering analysis

What is Clustering Analysis?

Clustering in Data Mining Applications

Handling Different Types of Variables

Major Clustering Techniques

Outlier Discovery

Problems and Challenges

Page 15: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

16

Five Categories of Clustering Methods

Partitioning algorithms: Construct various partitions and

then evaluate them by some criterion.

Hierarchy algorithms: Create a hierarchical decomposition

of the set of data (or objects) using some criterion.

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the

clusters and the idea is to find the best fit of that model to

each other.

Page 16: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

17

Partitioning Algorithms: Basic Concept

Partitioning method: Construct a partition of a database D

of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the

chosen partitioning criterion. Global optimal: exhaustively enumerate all partitions.

Heuristic methods: k-means and k-medoids algorithms.

k-means (MacQueen’67): Each cluster is represented by the center

of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects in

the cluster.

Page 17: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

18

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of

the current partition. The centroid is the center (mean point) of the cluster.

Assign each object to the cluster with the nearest seed point.

Go back to Step 2, stop when no more new assignment.

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 18: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

19

Comments on the K-Means Method

Strength of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of

clusters, and t is # of iterations. Normally, k, t << n. Often terminates at a local optimum.

Weakness of the k-means: Applicable only when mean is defined, then what about

categorical data? Need to specify k, the number of clusters, in advance. Unable to handle noisy data and outliers. Not suitable to discover clusters with non-convex shapes.

Page 19: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

20

The K-Medoids Clustering Method

Find representative objects, called medoids, in clusters To achieve this goal, only the definition of distance from

any two objects is needed. PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering.

PAM works effectively for small data sets, but does not scale well for large data sets.

Page 20: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

21

Two Types of Hierarchical Clustering Algorithms

Agglomerative (bottom-up): merge clusters iteratively. start by placing each object in its own cluster merge these atomic clusters into larger and larger clusters until all objects are in a single cluster. Most hierarchical methods belong to this category. They

differ only in their definition of between-cluster similarity. Divisive (top-down): split a cluster iteratively.

It does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces.

Divisive methods are not generally available, and rarely have been applied.

Page 21: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

22

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition.

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Page 22: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

23

More on Hierarchical Clustering Methods

between-cluster similarity Minimal distance Maximal distance Center distance

Major weakness of agglomerative clustering methods: do not scale well: time complexity of at least O(n2), where

n is the number of total objects can never undo what was done previously.

Integration of hierarchical clustering with distance-based method:

Page 23: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

24

Clustering analysis

What is Clustering Analysis?

Clustering in Data Mining Applications

Handling Different Types of Variables

Major Clustering Techniques

Outlier Discovery

Problems and Challenges

Page 24: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

25

What Is Outlier Discovery?

What are outliers? The set of objects are considerably dissimilar from the

remainder of the data Example: Sports: Michael Jordon, Wayne Gretzky, ...

Problem Given: Data points Find top n outlier points

Applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis

Page 25: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

26

Outlier Discovery Methods

Distance-based vs. statistics-based outlier analysis: Most outlier analyses are univariate (single-var) and

distribution-based (how do we know it is in a normal or gammar distribution?)

We need multi-dimensional analysis without knowing on data distribution.

Distance-based outlier: An object O in a dataset T is a DB(p, D)-outlier if at least

fraction p of the object in T lies greater than distance D from O.

Page 26: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

27

Clustering analysis

What is Clustering Analysis?

Clustering in Data Mining Applications

Handling Different Types of Variables

Major Clustering Techniques

Outlier Discovery

Problems and Challenges

Page 27: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

28

Problems and Challenges

Considerable progress has been made in scalable clustering methods: Partitioning: k-means, k-medoids, CLARANS Hierarchical: BIRCH, CURE Density-based: DBSCAN, CLIQUE, OPTICS Grid-based: STING, WaveCluster. Model-based: Autoclass, Denclue, Cobweb.

Current clustering techniques do not address all the requirements adequately.

Constraint-based clustering analysis: Constraints exists in data space (bridges and highways) or in user queries.

Page 28: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

29

Data Mining and Data Warehousing

Introduction Data warehousing and OLAP Data preprocessing for mining and warehousing Concept description: characterization and

discrimination Classification and prediction Association analysis Clustering analysis Mining complex data and advanced mining

techniques Trends and research issues

Page 29: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

30

Data Mining and Warehousing: Session 6

Association Analysis

Page 30: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

31

Session 6: Association Analysis

What is association analysis?

Mining single-dimensional Boolean association

rules in transactional databases

Mining multi-level association rules

Page 31: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

32

What Is Association Mining?

Association rule mining: Finding association, correlation, or causal structures

among sets of items or objects in transaction databases, relational databases, and other information repositories.

Applications: Basket data analysis, cross-marketing, catalog design, loss-

leader analysis, clustering, classification, etc.

Examples. Rule form: “Body ead [support, confidence]”. buys(x, “diapers”) buys(x, “beers”) [0.5%, 60%] major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%,

75%]

Page 32: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

33

Session 6: Association Analysis

What is association analysis?

Mining single-dimensional Boolean association

rules in transactional databases

Mining multi-level association rules

Page 33: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

34

What Is an Association Rule?

Given A database of customer transactions Each transaction is a list of items (purchased by a

customer in a visit) Find all rules that correlate the presence of one set of items

with that of another set of items Example: 98% of people who purchase tires and auto

accessories also get automotive services done Any number of items in the consequent/antecedent of rule Possible to specify constraints on rules (e.g., find only rules

involving Home Laundry Appliances).

Page 34: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

35

Application Examples

Market Basket Analysis * Maintenance Agreement

What the store should do to boost Maintenance Agreement sales

Home Electronics *

What other products should the store stocks up on if the store has a sale on Home Electronics

Attached mailing in direct marketing Detecting “ping-pong”ing of patients

transaction: patientitem: doctor/clinic visited by a patientsupport of a rule: number of common patients

Page 35: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

36

Rule Measures: Support and Confidence

Find all the rules X & Y Z with minimum confidence and support support, s, probability that a

transaction contains {X, Y, Z} confidence, c, conditional

probability that a transaction having {X, Y} also contains Z.

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%)

Customerbuys diaper

Customerbuys both

Customerbuys beer

Page 36: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

37

Mining Association Rules -- Example

For rule A C:support = support({A, C}) = 50%

confidence = support({A, C})/support({A}) = 66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent.

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Page 37: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

38

Mining Frequent Itemsets: the Key Step

Find the frequent itemsets: the sets of items that have

minimum support A subset of a frequent itemset must also be a frequent

itemset, i.e., if {AB} is a frequent itemset, both {A} and {B}

should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1

to k (k-itemset)

Use the frequent itemsets to generate association

rules.

Page 38: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

39

The Apriori Algorithm

Ck: Candidate itemset of size kLk : frequent itemset of size k

L1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;

Page 39: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

40

The Apriori Algorithm -- Example

TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database D itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

itemset sup.{1} 2{2} 3{3} 3{5} 3

Scan D

C1L1

itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

L2

C2 C2Scan D

C3 L3itemset{2 3 5}

Scan D itemset sup{2 3 5} 2

Page 40: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

41

Generating Association Rules

A Naive Algorithm

for each frequent itemset F do

for each subset c of F do

if ( support(F)/support(F-c) minconf ) then

output rule (F-c) c,

with confidence = support(F)/support (F-c)

and support = support(F)

Page 41: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

42

Session 6: Association Analysis

What is association analysis?

Mining single-dimensional Boolean association

rules in transactional databases

Mining multi-level association rules

Page 42: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

43

Multiple-Level Association Rules

Items often form hierarchy. Items at the lower level are

expected to have lower support. Rules regarding itemsets at

appropriate levels could be quite useful.

Transaction database can be encoded based on dimensions and levels

It is smart to explore shared multi-level mining (Han & Fu,VLDB’95).

Food

breadmilk

skim

SunsetFraser

2% whitewheat

TID ItemsT1 {111, 121, 211, 221}T2 {111, 211, 222, 323}T3 {112, 122, 221, 411}T4 {111, 121}T5 {111, 122, 211, 221, 413}

Page 43: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

44

Mining Multi-Level Associations

A top_down, progressive deepening approach: First find high-level strong rules:

milk bread [20%, 60%]. Then find their lower-level “weaker” rules:

2% milk wheat bread [6%, 50%].

Variations at mining multiple-level association rules.– Level-crossed association rules:

2% milk Wonder wheat bread– Association rules with multiple, alternative hierarchies:

2% milk Wonder bread

Page 44: 1 Copyright Jiawei Han, modified by Charles Ling for CS411a Course Outline d Introduction d Data warehousing and OLAP d Data preprocessing for mining and.

Copyright Jiawei Han, modified by Charles Ling for CS411a

45

Multi-Level Mining: Progressive Deepening

A top-down, progressive deepening approach: First mine high-level frequent items:

milk (15%), bread (10%) Then mine their lower-level “weaker” frequent itemsets:

2% milk (5%), wheat bread (4%)

Different min_support threshold across multi-levels lead to different algorithms: If adopting the same min_support across multi-levels

then toss t if any of t’s ancestors is infrequent.

If adopting reduced min_support at lower levelsthen examine only those descendents whose ancestor’s support is

frequent/non-negligible.