Association Rule Mining and Clusteringbcssp10.files.wordpress.com/2013/02/lecture191.pdf · Association Rule Mining and Clustering Lecture Outline: •Classiﬁcation vs. Association

Association Rule Mining and Clustering

Lecture Outline:

• Classification vs. Association Rule Mining vs. Clustering

• Association Rule Mining

• Clustering

– Types of Clusters

– Clustering Algorithms

∗ Hierarchical: agglomerative, divisive

∗ Non-hierarchical: k-means

Reading:

Chapters 3.4, 3.9, 4.5, 4.8, 6.6 Witten and Frank, 2nd ed.

Chapter 14, Foundations of Statistical Language Processing, C.D. Manning & H. Schutze,

MIT Press, 1999

COM3250 / 6170 1 2010-2011

Classification vs. Association Rule Mining vs. Clustering

• So far we have primarily focused onclassification:

– Given: a set of training examples represented as pairs of attribute value vectors (instancerepresentations) + a designated target class

– Learn: how to predict the target class of an unseen instance

– Example: learn to distinguish edible/poisonous mushrooms; credit-worthy loan applicants

• Works well if we understand which attributes are likely to predict others and/or we have aclear-cut classification task in mind.

• However in other cases there may be no distinguished class attribute.

COM3250 / 6170 2 2010-2011








• We may want to learnassociation rulescapturing regularities underlying a dataset:

– Given: set of training examples represented as attribute value vectors

– Learn: if-then rules expressing significant associations between attributes

– Example: learn associations between items consumers buy atthe supermarket

COM3250 / 6170 2-a 2010-2011








• We may want to learnassociation rulescapturing regularities underlying a dataset:

– Given: set of training examples represented as attribute value vectors

– Learn: if-then rules expressing significant associations between attributes

– Example: learn associations between items consumers buy atthe supermarket

• We may want discoverclustersin our data either to understand the data or to train classifiers

– Given: set of training examples + a similarity measure

– Learn: a set of clusters capturing significant groupings amongst instances

– Example: cluster documents returned by a search engine

COM3250 / 6170 2-b 2010-2011

Association Rule Mining

• Could use rule learning methods studied earlier:

– consider each possible attribute + value and each possible combination of attributes +

values as a potential consequent (RHS) of anif-then rule

– run a rule induction process to induce rules for each such consequent

– then prune resulting association rules by

∗ coverage– number of instances rules correctly predicts (also calledsupport); and

∗ accuracy– proportion of instances to which the rule applies which it correctly predicts

(also calledconfidence)

COM3250 / 6170 3 2010-2011

Association Rule Mining

• Could use rule learning methods studied earlier:

– consider each possible attribute + value and each possible combination of attributes +

values as a potential consequent (RHS) of anif-then rule

– run a rule induction process to induce rules for each such consequent

– then prune resulting association rules by

∗ coverage– number of instances rules correctly predicts (also calledsupport); and

∗ accuracy– proportion of instances to which the rule applies which it correctly predicts

(also calledconfidence)

• However, given the combinatorics such an approach is computationally infeasible . . .

COM3250 / 6170 3-a 2010-2011

Association Rule Mining (cont)

• Instead, assume we are only interested in rules with some minimum coverage

– Look for combinations of attribute-value pairs with pre-specified minimum coverage –

calleditem sets, where anitem is an attribute-value pair (terminology borrowed from

market basket analyis, where associations are sought between items customers buy)

– This approach followed by theApriori association rule minerin Weka (Agrawal et al.)

• Sequentially generate all 1-item, 2-item, 3-item,. . .n-item sets that have minimum coverage

– This can be done efficiently by observing that ann-item set can achieve minimum coverage

only if all of then−1-item sets which are subsets of then-item set have minimum coverage

– Example: in the PlayTennis dataset the 3-item set

{humidity=normal, windy=false, play=yes }

has coverage 4 (i.e. these three attribute value pairs are true of 4 instances).

COM3250 / 6170 4 2010-2011

Association Rule Mining (cont)

• Next, form rules by considering for each minimum coverage item set all possible rulescontaining 0 or more attribute value pairs from the item set in the antecedent and one or moreattribute value pairs from the item set in the consequent.

– From the 3-item set{humidity=normal, windy=false, play=yes } generate 7 rules:

Association Rule Accuracy

IF humidity=normal ∧ windy=false THEN play=yes 4/4

IF humidity=normal ∧ play=yes THEN windy=false 4/6

IF windy=false ∧ play=yes THEN humidity=normal 4/6

IF humidity=normal THEN windy=false ∧ play=yes 4/7

IF windy=false THEN humidity=normal ∧ play=yes 4/8

IF play=yes THEN humidity=normal ∧ windy=false 4/9

IF -- THEN humidity=normal ∧ windy=false ∧ play=yes 4/12

• Keep only those rules that meet pre-specified desired accuracy – e.g. in this example only firstrule kept if accuracy of 100% is specified

• For PlayTennis dataset there are:

– 3 rules with coverage 4 and accuracy 100%



COM3250 / 6170 5 2010-2011

Types of Clusters

• Approaches to clustering can be characterised in various ways

• One characterisation is by the type of clusters produced – clusters may be:

1. partitions of the instance space – each instance is assigned to exactly one cluster

2. overlapping subsets of the instance space – instances maybelong to more than one cluster

3. probabilities of cluster membership associated with instance – each instance has someprobability of belonging to each cluster

4. hierarchical structures – any given cluster may consist of subclusters or instances or both

(4)

a

d

j

h

e

c

b

gi

fk

a

d

k

j

g

e

h

c b

f

i

a 0.4 0.1 0.5b 0.1 0.8 0.1c 0.3 0.3 0..4d 0.1 0.1 0.8e 0.4 0.2 0.4f 0..1 0.4 0.5g 0.7 0.2 0.1h 0.5 0.4 0.1

1 2 3

g a c i e d k b j f h

(1) (2)

(3)

COM3250 / 6170 6 2010-2011

Clustering Algorithms: A Taxonomy

• Hierarchical clustering

– Agglomerative: bottom up– start with individual instances and group the most similar

– Divisive: top down– start with all instances in a single cluster and divide intogroups so as

to maximize within group similarity

– Mixed: start with individual instances and either add to existingcluster or form new

cluster, possibly merging or splitting instances in existing clusters (CobWeb)

• Non-hierarchical (“flat”) clustering

– Partitioning approaches– hypothesisek clusters, randomly pick cluster centres, and

iteratively assign instances to centres and recompute centres until stable

– Probabilistic approaches– hypothesisek clusters each with an associated (initially

guessed) probability distribution of attribute values forinstances in the cluster, then

iteratively compute cluster probabilities for each instance and recompute cluster parameters

until stability

• Incremental vs batch clustering: are clusters computed dynamically as instances become

available (CobWeb) or statically on presumption whole instance set is available?

COM3250 / 6170 7 2010-2011

Hierarchical Clustering: Agglomerative Clustering

Given: a setX = {x1, . . . ,xn} of instances + a function sim: 2X×2X → Rfor i = 1 to n

ci ←{xi}

endfor

C←{c1, . . . ,cn}

j ← n+1

while |C|> 1

(cn1 ,cn2)← argmax(cu,cv)∈C×Csim(cu,cv)

c j ← cn1 ∪cn2

C←C\{cn1,cn2}∪{c j}

j ← j +1

returnC

(Manning & Schutze, p. 502)

• Start with a separate cluster for each instance

• Repeatedly determine two most similar clusters and merge them together

• Terminate when a single cluster containing all instances has been formed

COM3250 / 6170 8 2010-2011

Hierarchical Clustering: Divisive Clustering

Given: a setX = {x1, . . . ,xn} of instances + a function coh: 2X → R

+ a function split: 2X → 2X×2X

C←{X} (= {c1})

j ← 1

while ∃ci ∈C s.t. |ci |> 1

cu← argmincv∈C coh(cv)

(c j+1,c j+2) = split(cu)

C←C\{cu}∪{c j+1,c j+2}

j ← j +2

returnC


• Start with a single cluster containing all instances

• Repeatedly determine least coherent cluster and split intotwo subclusters

• Terminate when no cluster contains more than one instance

COM3250 / 6170 9 2010-2011

Similarity Functions used in Clustering (1)

• Single Link: similarity between two clusters = similarity of the twomostsimilar members

• Complete Link: similarity between two clusters = similarity of the twoleastsimilar members

• Group average: similarity between two clusters =averagesimilarity between members

2d

1 2 3 4 5 6 7 8

1

2

3

4

5a b c d

e f g h

d

3/2 d

Mannning & Schutze, pp. 504-505

COM3250 / 6170 10 2010-2011


• Best initial move is to mergea/b, c/d, e/ f andg/h since the similarities between these

objects are greatest (assume similarity reciprocally related to distance)

2d

1 2 3 4 5 6 7 8

1

2

3

4

5a b c d

e f g h

d

3/2 d


COM3250 / 6170 11 2010-2011


• Usingsingle link clusteringthe clusters{a,b} and{c,d} and also{e, f} and{g,h} are mergednext since the pairsb/c and f/g are closer than other pairs not in the same cluster (e.g. thanb/ f or c/g) members

• Single link clustering results in elongated clusters (“chaining effect”) that are locally coherentin that close objects are in same cluster

– However, may have poor global quality –a is closer toe than tod, buta andd are in samecluster whilea andeare not.

2d

1 2 3 4 5 6 7 8

1

2

3

4

5a b c d

e f g h

d

3/2 d


COM3250 / 6170 12 2010-2011


• Complete link clusteringavoids this problem by focusing on global rather than local quality –

similarity of two clusters is the similarity of their two most dissimilar members

• Results in “tighter” clusters in the example than single link similarity

– minimally similar pairs for complete link clusters (a/ f or b/e) closer than minimally

similar pairs for single link clusters (a/d)

2d

1 2 3 4 5 6 7 8

1

2

3

4

5a b c d

e f g h

d

3/2 d


COM3250 / 6170 13 2010-2011


• Unfortunately complete link clustering has time complexity O(n3) – single link clustering is

O(n2).

• Group average clusteringis a compromise that isO(n2) but avoids the elongated clusters of

single link clustering.

• The average similarity between vectors in a clusterc is defined as:

S(c) =1

|c|(|c|−1) ∑~x∈c

∑~x6=~y∈c

sim(~x,~y)

• At each iteration the clustering algorithm picks two clusterscu andcv that maximizeS(cu∪cv)

• To carry out group average clustering efficiently care must be taken to avoid recomputing

average similarities from scratch after each of the mergingsteps

– can avoid doing this representing instances as length-normalised vectors in m-dimensional

real-valued space and using cosine similarity measure

– given this approach average similarity of a cluster can be computed in constant time from

the average similarity of its two children – see Manning and Schutze for details

COM3250 / 6170 14 2010-2011


• In top down hierarchical clustering a measure is needed for cluster coherence and an operation

to split clusters must be defined

• The similarity measures already defined for bottom up clustering can be used for these tasks

• Coherencecan be defined as

– the smallest similarity in the minimum spanning tree for thecluster (tree connecting all

instances the sum of whose edge lengths is minimal) according to the single link similarity

measure

– the smallest similarity between any two instances in the cluster, according to the complete

link measure

– the average similarity between objects in the cluster, according to the group average

measure

• Once the least coherent cluster is identified it needs to be split

• Splitting can be seen as a clustering task – find two subclusters of a given cluster – any

clustering algorithm can be used for this task

COM3250 / 6170 15 2010-2011

Non-hierarchical Clustering

• Non-hierarchical clustering algorithms typically start with a partition based on randomlyselected seeds and then iteratively refine this partition byreallocating instances to current bestcluster

– contrast with hierarchical algorithms which typically require only one pass

COM3250 / 6170 16 2010-2011




• Termination occurs when according to some measure of goodness clusters are no longerimproving

– measures of goodness include: group average similarity; mutual information betweenadjacent clusters; likelihood of data given the clusteringmodel

COM3250 / 6170 16-a 2010-2011






• How many clusters?

– may have some prior knowledge about right number of clusters

– can try various cluster numbersn and see how measures of cluster goodness compare or ifthere is a reduction in the rate of increase of goodness for somen

– can use Minimum Description Length to minimize sum of lengths of encodings ofinstances in terms of distance from clusters + encodings of clusters

COM3250 / 6170 16-b 2010-2011






• How many clusters?

– may have some prior knowledge about right number of clusters

– can try various cluster numbersn and see how measures of cluster goodness compare or ifthere is a reduction in the rate of increase of goodness for somen

– can use Minimum Description Length to minimize sum of lengths of encodings ofinstances in terms of distance from clusters + encodings of clusters

• Hierarchical clustering does not require number of clusters to be determined

– however full hierarchical clusterings are rarely usable and tree must be cut at some point≡to specifying a number of clusters

COM3250 / 6170 16-c 2010-2011

K-means Clustering

Given: a setX = {~x1, . . . ,~xn} ∈ Rm

+ a distance measure:d : Rm×Rm→ R+ a function for computing the meanµ : P(R)→ Rm

selectk initial centres~f1, . . . ,~fkwhile stopping criterion is not true

for all clustersc j

c j = {~xi | ∀~fl d(~xi , ~f j )≤ d(~xi , ~fl )}

end for

for all means~f j~f j = µ(c j )

end for

end while


• The algorithm picksk initial centres and forms clusters by allocating each instance to itsnearest centre

• Centres for each cluster are recomputed as thecentroidor mean of the cluster’s members:~µ= (1/|c j |)∑~x∈c j

~x and instances are once again allocated to their nearest centre

• The algorithm iterates until stability or some measure of goodness is attained

COM3250 / 6170 17 2010-2011

K-means Clustering – Movement of Cluster Centres

http://www.cs.umd.edu/˜mount/Projects/KMeans/Images /centers.gif

COM3250 / 6170 18 2010-2011

Probability-based Clustering and The EM Algorithm

• In probability-based clustering an instance is not placed categorically in a single cluster, butrather is assigned a probability of belonging to every cluster

COM3250 / 6170 19 2010-2011



• Basis of statistical clustering is thefinite mixture model

– mixture ofk probability distributions representingk clusters

– each distribution gives probability that an instance wouldhave a certain set of attributevalues if it were known to be a member of that cluster

COM3250 / 6170 19-a 2010-2011






• Clustering problem is to take a set of instances and a pre-specified number of clusters andwork out each cluster’s mean and variance and the populationdistribution between clusters

COM3250 / 6170 19-b 2010-2011






• Clustering problem is to take a set of instances and a pre-specified number of clusters andwork out each cluster’s mean and variance and the populationdistribution between clusters

• EM – Expectation-Maximisation– is an algorithm for doing this

– Like k-means start with guess for parameters governing clusters

– Use these parameters to calculate cluster probabalities for each instance (expectation – ofclass values)

– Use these cluster probabilities for each instance to re-estimate cluster parameters

(maximisation – of the likelihood of the distributions given the data)

– Terminate when some goodness measure is met – usually when increase in log likelihoodthat data came from the finite mixture model is negligible between iterations

COM3250 / 6170 19-c 2010-2011

Summary

• While classification – learning to predict an instance’s class given a set of attribute values – is

central to machine learning it is not the only task of interest/value

COM3250 / 6170 20 2010-2011

Summary



• When there is no clearly distinguished class attribute we may want to

– learn association rules reflecting regularities underlying the data

– discover clusters in the data

COM3250 / 6170 20-a 2010-2011

Summary






• Association rules can be learned by a procedure which

– identifies sets of attribute-value pairs which occur together sufficiently often to be of

interest

– proposes rules relating these attribute-value pairs whoseaccuracy over the data set is

sufficiently high as to be useful

COM3250 / 6170 20-b 2010-2011

Summary






• Association rules can be learned by a procedure which

– identifies sets of attribute-value pairs which occur together sufficiently often to be of

interest

– proposes rules relating these attribute-value pairs whoseaccuracy over the data set is

sufficiently high as to be useful

• Clusters – which can “hard” or “soft”, hierarchical or non-hierarchical – can be discovered

using a variety of algorithms including:

– for hierarchical clusters: agglomerative or divisive clustering

– for non-hierarchical clusters:k-means or EM

COM3250 / 6170 20-c 2010-2011

Association Rule Mining and Clusteringbcssp10.files.wordpress.com/2013/02/lecture191.pdf · Association Rule Mining and Clustering Lecture Outline: •Classiﬁcation vs. Association

Documents