E.G.M. Petrakis Text Clustering 1 Clustering “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99 ] Instances within a cluster are very similar Instances in different clusters are very different
53
Embed
E.G.M. PetrakisText Clustering1 Clustering “Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
E.G.M. Petrakis Text Clustering 1
Clustering
“Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters)” [ACM CS’99]
Instances within a cluster are very similar
Instances in different clusters are very different
E.G.M. Petrakis Text Clustering 2
Example
.
...
. ..
.
.
.
.
.......
.
.
..
term1
term2
E.G.M. Petrakis Text Clustering 3
Applications
Faster retrieval Faster and better browsingStructuring of search results Revealing classes and other data
regularities Directory constructionBetter data organization in general
E.G.M. Petrakis Text Clustering 4
Cluster Searching
Similar instances tend to be relevant to the same requests
The query is mapped to the closest cluster by comparison with the cluster-centroids
E.G.M. Petrakis Text Clustering 5
Notation
N: number of elementsClass: real world grouping – ground
truthCluster: grouping by algorithmThe ideal clustering algorithm will
produce clusters equivalent to real world classes with exactly the same members
E.G.M. Petrakis Text Clustering 6
Problems
How many clusters ?Complexity? N is usually largeQuality of clusteringWhen a method is better than
another?Overlapping clusters Sensitivity to outliers
from the entire data set K-means, Bisecting K-meansHierarchical or flat clustering
Agglomerative: build clusters “bottom-up” starting with individual instances and by iteratively combining them to form larger cluster at higher levelHierarchical clustering
Combinations of the aboveBuckshot algorithm
E.G.M. Petrakis Text Clustering 9
Hierarchical – Flat Clustering
Flat: all clusters at the same level K-means, Buckshot
Hierarchical: nested sequence of clustersSingle cluster with all data at the top &
singleton clusters at the bottomIntermediate levels are more useful Every intermediate level combines two clusters
from the next lower levelAgglomerative, Bisecting K-means
E.G.M. Petrakis Text Clustering 10
Flat Clustering
.
...
. ..
.
.
.
.
.......
.
.
..
E.G.M. Petrakis Text Clustering 11
Hierarchical Clustering
..
.. .. .. .. ..... .. ... ..
.. . ....
5
46
7
23
1
4
1
2 3
5 6 7
E.G.M. Petrakis Text Clustering 12
Text Clustering
Finds overall similarities among documents or groups of documentsFaster searching, browsing etc.
Needs to know how to compute the similarity (or equivalently the distance) between documents
E.G.M. Petrakis Text Clustering 13
Query – Document Similarity
M
i id
M
i id
M
i idid
ww
ww
dd
ddddSim
1
2
1
2
1
21
2121
21
21
||||),(
Similarity is defined as the cosine of the angle between document and query vectors
θ
d1
d2
E.G.M. Petrakis Text Clustering 14
Document Distance
Consider documents d1, d2 with vectors u1, u2
Their distance is defined as the length AB
)),(im-1(2
))cos(-1(2
)2/sin(2
),(tan
21
21
ddS
ddcedis
E.G.M. Petrakis Text Clustering 15
M
k kj
ijij
w
ww
1
2'
Normalization by Document Length
The longer the document is, the more likely it is for a given term to appear in it
Normalize the term weights by document length (so terms in long documents are not given more weight)
E.G.M. Petrakis Text Clustering 16
Evaluation of Cluster Quality
Clusters can be evaluated using internal or external knowledge
Internal Measures: intra cluster cohesion and cluster separability intra cluster similarity inter cluster similarity
External measures: quality of clusters compared to real classes Entropy (E), Harmonic Mean (F)
E.G.M. Petrakis Text Clustering 17
Intra Cluster SimilarityA measure of cluster cohesionDefined as the average pair-wise similarity
of documents in a cluster
Where : cluster centroid
Documents (not centroids) have unit length
Sd
dS
c 1
SdSdSdd
cccdS
dS
ddsimS '
2
',2 '
11)',(
1
E.G.M. Petrakis Text Clustering 18
Inter Cluster Similarity
a) Single Link: similarity of two most similar members
b) Complete Link: similarity of two least similar members
c) Group Average: average similarity between members
',,),(max '' ScScccsim jiji
',,),(min '' ScScccsim jiji
)',('
''
11)',(
'
1
'''
ccsimcc
dS
dS
ddsimSS SdSd
CdCd
E.G.M. Petrakis Text Clustering 19
Example
.. .
.c c’
single link
complete link
groupaverage
S S’
E.G.M. Petrakis Text Clustering 20
Entropy
Measures the quality of flat clusters using external knowledge Pre-existing classificationAssessment by experts
Pij: probability that a member of cluster j belong to class i
The entropy of cluster j is defined as Ej=-ΣiPijlogPij
E.G.M. Petrakis Text Clustering 21
Entropy (con’t)Total entropy for all clusters
Where nj is the size of cluster jm is the number of clustersN is the number of instancesThe smaller the value of E is the better the
quality of the algorithm isThe best entropy is obtained when each
cluster contains exactly one instance
j
m
j
j EN
nE
1
E.G.M. Petrakis Text Clustering 22
Harmonic Mean (F)
Treats each cluster as a query result F combines precision (P) and recall (R) Fij for cluster j and class i is defined as
nij: number of instances of class i in cluster j,
ni: number of instances of class i,
nj: number of instances of cluster j
, where1
P1
2F
ij
iji
ijij
j
ijij
ij
n
nR
n
nP
R
E.G.M. Petrakis Text Clustering 23
Harmonic Mean (con’t)
The F value of any class i is the maximum value it achieves over all j
Fi = maxj Fij
The F value of a clustering solution is computed as the weighted average over all classes
Where N is the number of data instances
i
m
i
i Fn
nF
1
E.G.M. Petrakis Text Clustering 24
Quality of ClusteringA good clustering method
Maximizes intra-cluster similarityMinimizes inter cluster similarityMinimizes Entropy Maximizes Harmonic Mean
Difficult to achieve all together simultaneously
Maximize some objective function of the above
An algorithm is better than an other if it has better values on most of these measures
E.G.M. Petrakis Text Clustering 25
K-means Algorithm
Select K centroids Repeat I times or until the centroids
do not change Assign each instance to the cluster
represented by its nearest centroidCompute new centroids Reassign instances Compute new centroids…….
Generates a flat partition of K clustersK is the desired number of clusters and
must be known in advanceStarts with K random cluster centroids A centroid is the mean or the median
of a group of instancesThe mean rarely corresponds to a real
instance
E.G.M. Petrakis Text Clustering 34
Comments on K-Means (2)
Up to I=10 iterationsKeep the clustering resulted in best
inter/intra similarity or the final clusters after I iterations
Complexity O(IKN)A repeated application of K-Means
for K=2, 4,… can produce a hierarchical clustering
E.G.M. Petrakis Text Clustering 35
Choosing Centroids for K-means
Quality of clustering depends on the selection of initial centroids
Random selection may result in poor convergence rate, or convergence to sub-optimal clusterings.
Select good initial centroids using a heuristic or the results of another methodBuckshot algorithm
E.G.M. Petrakis Text Clustering 36
Incremental K-Means
Update each centroid during each iteration after each point is assigned to a cluster rather than at the end of each iteration
Reassign instances to clusters at the end of each iteration
Converges faster than simple K-meansUsually 2-5 iterations
E.G.M. Petrakis Text Clustering 37
Bisecting K-Means
Starts with a single cluster with all instances
Select a cluster to split: larger cluster or cluster with less intra similarity
The selected cluster is split into 2 partitions using K-means (K=2)
Repeat up to the desired depth hHierarchical clusteringComplexity O(2hN)
E.G.M. Petrakis Text Clustering 38
Agglomerative Clustering
Compute the similarity matrix between all pairs of instances
Starting from singleton clustersRepeat until a single cluster remains
Merge the two most similar clustersReplace them with a single cluster Replace the merged cluster in the
matrix and update the similarity matrixComplexity O(N2)
E.G.M. Petrakis Text Clustering 39
Similarity Matrix
C1=d1 C2=d2 … CN=dN
C1=d1 1 0.8 … 0.3
C2=d2 0.8 1 … 0.6
…. … … 1 …
CN=dN 0.3 0.6 … 1
E.G.M. Petrakis Text Clustering 40
Update Similarity Matrix
C1=d1 C2=d2 … CN=dN
C1=d1 1 0.8 … 0.3
C2=d2 0.8 1 … 0.6
…. … … 1 …
CN=dN 0.3 0.6 … 1
merged
merged
E.G.M. Petrakis Text Clustering 41
New Similarity Matrix
C12=
d1 d2
… CN=dN
C12 =
d1 d2
1 … 0.4
… … 1 …
CN=dN 0.4 … 1
E.G.M. Petrakis Text Clustering 42
Single Link
Selecting the most similar clusters for merging using single link
Can result in long and thin clusters due to “chaining effect”Appropriate in some domains, such as
clustering islands
', ),(max '' ScScccsim jiji
E.G.M. Petrakis Text Clustering 43
Complete Link
Selecting the most similar clusters for merging using complete link
Results in compact, spherical clusters that are preferable
', ),(min '' ScScccsim jiji
E.G.M. Petrakis Text Clustering 44
Group Average
Selecting the most similar clusters for merging using group average
Fast compromise between single and complete link
)',('
''
11)',(
'
1
'''
ccsimcc
dS
dS
ddsimSS SdSd
SdSd
E.G.M. Petrakis Text Clustering 45
Example
.. .
.c1
c2
single link
complete link
groupaverage
A B
E.G.M. Petrakis Text Clustering 46
Inter Cluster Similarity
A new cluster is represented by its centroid
The document to cluster similarity is computed as
The cluster-to-cluster similarity can be computed as single, complete or group average similarity
Sd
dS
c 1
cdcdsim
),(
E.G.M. Petrakis Text Clustering 47
Buckshot K-Means
Combines Agglomerative and K-MeansAgglomerative results in a good
clustering solution but has O(N2) complexity
Randomly select a sample N instancesApplying Agglomerative on the sample
which takes (N) time Take the centroids of the cluster as
input to K-Means Overall complexity is O(N)
E.G.M. Petrakis Text Clustering 48
Example
4
1
2 3
65 7
8 9 10 11 12 13 14 15
initialcetroids
for K-Means
E.G.M. Petrakis Text Clustering 49
More on Clustering
Sound methods based on the document-to-document similarity matrixgraph theoretic methodsO(N2) time
Iterative methods operating directly on the document vectorsO(NlogN),O(N2/logN), O(mN) time
E.G.M. Petrakis Text Clustering 50
Soft Clustering
Hard clustering: each instance belongs to exactly one clusterDoes not allow for uncertaintyAn instance may belong to two or more clusters
Soft clustering is based on probabilities that an instance belongs to each of a set of clustersprobabilities of all categories must sum to 1Expectation Minimization (EM) is the most
popular approach
E.G.M. Petrakis Text Clustering 51
More Methods
Two documents with similarity > T (threshold) are connected with an edge [Duda&Hart73]
clusters: the connected components (maximal cliques) of the resulting graph
problem: selection of appropriate threshold T
Zahn’s method [Zahn71]
E.G.M. Petrakis Text Clustering 52
Zahn’s method [Zahn71]
1. Find the minimum spanning tree 2. for each doc delete edges with length l > lavg
lavg: average distance if its incident edges
3. clusters: the connected components of the graph
the dashed edge is inconsistent and is deleted
E.G.M. Petrakis Text Clustering 53
References "Searching Multimedia Databases by Content",