14.11.2001 Data mining: Clustering 1 Intro/Ass. Rules Intro/Ass. Rules Episodes Episodes Text Mining Text Mining Home Exam Home Exam 24./26.10. 30.10. Clustering Clustering KDD Process KDD Process Appl./Summary Appl./Summary 14.11. 21.11. 7.11. 28.11. Course on Data Mining Course on Data Mining (581550-4) (581550-4)
54
Embed
14.11.2001Data mining: Clustering1 Intro/Ass. Rules EpisodesEpisodes Text Mining Home Exam 24./26.10.30.10. ClusteringClustering KDD Process Appl./SummaryAppl./Summary.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• If q = 1, the distance measure is Manhattan (Manhattan (or city block) distancecity block) distance
• If q = 2, the distance measure is Euclidean Euclidean distancedistance
||...||||),(2211 pp jxixjxixjxixjid
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
14.11.2001 Data mining: Clustering 18
Binary variables (1)Binary variables (1)
• A binary variable has only two statestwo states: 0 or 1
• A contingency tablecontingency table for binary data
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
14.11.2001 Data mining: Clustering 19
Binary variables (2)Binary variables (2)
• Simple matching coefficientSimple matching coefficient (invariant similarity, if the binary variable is symmetricsymmetric):
• Jaccard coefficientJaccard coefficient (noninvariant similarity, if the
binary variable is asymmetricasymmetric):
dcbacb jid
),(
cbacb jid
),(
14.11.2001 Data mining: Clustering 20
Binary variables (3)Binary variables (3)
ExampleExample: dissimilarity between binary variables:
• a patient record table
• eight attributes, of which
o gender is a symmetric attribute, and
o the remaining attributes are asymmetric binary
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N
14.11.2001 Data mining: Clustering 21
Binary variables (4)Binary variables (4)
• Let the values YY and PP be set to 1, and the value NN be set to 0
• Compute distances between patients based on the asymmetric variables by using Jaccard coefficient
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
14.11.2001 Data mining: Clustering 22
Nominal variablesNominal variables
• A generalization generalization of the binary variable in that it can take
more than 2 statesmore than 2 states, e.g., red, yellow, blue, green
• Method 1: simple matching simple matching
o m: m: # of matches,, p: p: total # of variables
• Method 2: use a large number of binary variables use a large number of binary variables
o create a new binary variable for each of the MM
nominal states
pmpjid ),(
14.11.2001 Data mining: Clustering 23
Ordinal variablesOrdinal variables
• An ordinal variable can be discrete or continuous discrete or continuous
• Order of values is important, e.g., rank
• Can be treated like interval-scaled interval-scaled
o replacing xif by their rank
o map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
o compute the dissimilarity using methods for interval-scaled variables
11
f
ifif M
rz
} ..., ,1 { fif
Mr
14.11.2001 Data mining: Clustering 24
Ratio-scaled variablesRatio-scaled variables
• A positive measurement on a nonlinear scaleA positive measurement on a nonlinear scale, approximately at exponential scale
o for example, AeAeBtBt or AeAe-Bt-Bt
• Methods:Methods:
o treat them like interval-scaled variables — not a good choice! (why?)
o apply logarithmic transformation yyif if = log(x= log(xifif))
o treat them as continuous ordinal data and treat their rank as interval-scaled
14.11.2001 Data mining: Clustering 25
Variables of mixed types (1)Variables of mixed types (1)
• A database may contain all the six types of variables
• One may use a weighted formulaweighted formula to combine their effects:
where
)(1
)()(1),(
fij
pf
fij
fij
pf
djid
1 otherwise
;0or
missing, is or if 0)(
δ(f)
ij
jfif
jfiff
ij
xx
xx
14.11.2001 Data mining: Clustering 26
Variables of mixed types (2)Variables of mixed types (2)
Contribution of variableContribution of variable f to distance to distance d(i,j)::• if f is binary or nominal:
• if f is interval-based: use the normalized distance
• if f is ordinal or ratio-scaledo compute ranks rif and
o and treat zif as interval-scaled
1
1
f
if
Mrz
if
1 otherwise ; if 0)()( dd
f
ijjfif
f
ij xx
14.11.2001 Data mining: Clustering 27
Complex data typesComplex data types
• All objects considered in data mining are not relational => complex types of datacomplex types of datao examples of such data are spatial data, multimedia
data, genetic data, time-series data, text data and data collected from World-Wide Web
• Often totally different similarity or dissimilarity measures than aboveo can, for example, mean using of string and/or
sequence matching, or methods of information retrieval
o using new dissimilarity measures to deal with categorical objects
o using a frequency-basedfrequency-based method to update modes of clusters
• A mixture of categorical and numerical data: k-prototypek-prototype method
14.11.2001 Data mining: Clustering 38
K-medoids clustering methodK-medoids clustering method• Input to the algorithmInput to the algorithm: the number of clusters k, and a
database of n objects
• Algorithm consists of four stepsAlgorithm consists of four steps:
1. arbitrarily choose k objects as the initial medoidsmedoids (representative objects)
2. assign each remaining object to the cluster with the nearest medoid
3. select a nonmedoid and replace one of the medoids with it if this improves the clustering
4. go back to Step 2, stop when there are no more new assignments
14.11.2001 Data mining: Clustering 39
Hierarchical methodsHierarchical methods
• A hierarchical method:A hierarchical method: construct a hierarchy of clustering, not just a single partition of objects
• The number of clusters kk is not required as an input
• Use a distance matrixdistance matrix as clustering criteria
• A termination conditiontermination condition can be used (e.g., a number of clusters)
14.11.2001 Data mining: Clustering 40
A tree of clusteringsA tree of clusterings
• The hierarchy of clustering is ofter given as a clustering treeclustering tree, also called a dendrogramdendrogram
o leaves of the tree represent the individual objects
o internal nodes of the tree represent the clusters
14.11.2001 Data mining: Clustering 41
Two types of Two types of hierarchical methods (1)hierarchical methods (1)
Two main types of hierarchical clustering techniques:Two main types of hierarchical clustering techniques:• agglomerativeagglomerative (bottom-up):
o place each object in its own cluster (a singleton)o merge in each step the two most similar clusters
until there is only one cluster left or the termination condition is satisfied
• divisivedivisive (top-down):o start with one big cluster containing all the objectso divide the most distinctive cluster into smaller
clusters and proceed until there are n clusters or the termination condition is satisfied
14.11.2001 Data mining: Clustering 42
Two types of Two types of hierarchical methods (2)hierarchical methods (2)
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerativeagglomerative
divisivedivisive
14.11.2001 Data mining: Clustering 43
Inter-cluster distancesInter-cluster distances
• Three widely used ways of defining the inter-cluster inter-cluster distancedistance, i.e., the distance between two separate clusters, are
o single linkagesingle linkage method method (nearest neighbor):
o complete linkagecomplete linkage method method (furthest neighbor):
o average linkageaverage linkage method method (unweighted pair-group average):
),( min),( , yxdjidji CyCx
),( max),( , yxdjidji CyCx
),( ),( , yxdavgjidji CyCx
14.11.2001 Data mining: Clustering 44
Strengths of Strengths of hierarchical methodshierarchical methods
• Conceptually simpleConceptually simple
• Theoretical properties Theoretical properties are well understoodunderstood
• When clusters are merged/split, the decision is permanent => the decision is permanent => the number of different the number of different alternatives alternatives that need to be examined is reduced reduced
14.11.2001 Data mining: Clustering 45
Weaknesses of Weaknesses of hierarchical methodshierarchical methods
• Merging/splitting Merging/splitting of clusters is permanent => permanent => erroneous decisions are impossible to correct impossible to correct later
• Divisive methods can be computational hardcomputational hard
• Methods are not not (necessarily) scalable scalable for large data sets
14.11.2001 Data mining: Clustering 46
Outlier analysis (1)Outlier analysis (1)
• Outliers Outliers
o are objects that are considerably dissimilar from the remainder of the data
o can be caused by a measurement or execution error, or
o are the result of inherent data variability
• Many data mining algorithms tryMany data mining algorithms try
o to minimize the influence of outliers
o to eliminate the outliers
14.11.2001 Data mining: Clustering 47
Outlier analysis (2)Outlier analysis (2)
• Minimizing the effect of outliers and/or eliminating the outliers may cause information lossinformation loss
• Outliers themselves may be of interest => outlier mining=> outlier mining
• Applications of outlier miningApplications of outlier mining
o Fraud detection
o Customized marketing
o Medical treatments
14.11.2001 Data mining: Clustering 48
• Cluster analysis groups objects Cluster analysis groups objects based on their similaritybased on their similarity
• Cluster analysis has wide Cluster analysis has wide applicationsapplications
• Measure of similarity can be Measure of similarity can be computed for various type of computed for various type of datadata
• Selection of similarity measure Selection of similarity measure is dependent on the data used is dependent on the data used and the type of similarity we and the type of similarity we are searching forare searching for
Summary (1)Summary (1)
14.11.2001 Data mining: Clustering 49
• Clustering algorithms can be Clustering algorithms can be categorized into categorized into
o partitioning methods,partitioning methods,
o hierarchical methods, hierarchical methods,
o density-based methods, density-based methods,
o grid-based methods, and grid-based methods, and
o model-based methodsmodel-based methods
• There are still lots of research There are still lots of research issues on cluster analysisissues on cluster analysis
Classification of spatial dataClassification of spatial dataClassification of spatial dataClassification of spatial data
K. Koperski, J. Han, N. Stefanovic: K. Koperski, J. Han, N. Stefanovic: “An Efficient Two-Step Method of “An Efficient Two-Step Method of Classification of Spatial Data", Classification of Spatial Data", SDH’98SDH’98
K. Lagus, T. Honkela, S. Kaski, T. K. Lagus, T. Honkela, S. Kaski, T. Kohonen: “Self-organizing Maps Kohonen: “Self-organizing Maps of Document Collections: A New of Document Collections: A New Approach to Interactive Approach to Interactive Exploration”, KDD’96Exploration”, KDD’96
T. Honkela, S. Kaski, K. Lagus, T. T. Honkela, S. Kaski, K. Lagus, T. Kohonen: “WEBSOM – Self-Kohonen: “WEBSOM – Self-Organizing Maps of Document Organizing Maps of Document Collections”, WSOM’97Collections”, WSOM’97
14.11.2001 Data mining: Clustering 52
Thanks to Jiawei Han from Simon Fraser University
for his slides which greatly helped in preparing this lecture!
Course on Data MiningCourse on Data Mining
14.11.2001 Data mining: Clustering 53
References - clusteringReferences - clustering
• R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98
• M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
• M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD’99.
• P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, 1996
• M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96.
• M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95.
• D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.
• D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98.
14.11.2001 Data mining: Clustering 54
References - clusteringReferences - clustering
• S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.
• A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
• E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
• G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.
• R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
• E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.
• G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98.
• W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.
• T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96.