WHAT IS CLUSTER ANALYSIS? Exploratory technique which may be
used to search for category structure based on natural groupings in
the data, or reduce a very large body of data to a relatively
compact description Cluster Analysis vs. Discriminant Analysis
Discriminant : has prespecified groups Cluster : categorization is
based on dataSlides prepared by: and
CLUSTER ANALYSIS
Leilani Nora Assistant Scientist
Violeta Bartolome Senior Associate Scientist
PBGB-CRIL
CLUSTER ANALYSIS No assumptions are made concerning the number
of groups or the group structure. Grouping is done on the basis of
similarities or distances (dissimilarities). There are various
techniques in doing this which may give different results. Thus
researcher should consider the validity of the clusters found.
ILLUSTRATION
Consider sorting the 16 face cards in an ordinary deck into
clusters of similar objects. Cards may be classified by color suit
face value etc.
GROUPING FACE CARDS
A K Q J
M
A K Q J
M
A K Q J
M
TWO KINDS OF CLUSTER ANALYSIS
(a) individual cards
(b) individual suits
(c) black and red suits
Hierarchical - involves the construction of tree-like structure
(dendrogram) Non-hierarchical - designed to group objects into a
collection of k clusters. k may either be specified in advance or
determined as part of the clustering procedure.
A K Q J
M
A K Q J
M
A K Q J
M
(d) major and minor suits
(e)hearts plus queen of spades and other suits
(f) like face cards
DATA REQUIREMENT HIERARCHICAL CLUSTER ANALYSIS Agglomerative -
starts with individual objects. Most similar objects are first
grouped. Eventually, as the similarity decreases, all subgroups are
fused into a single cluster. Divisive - an initial single group of
objects is divided into two dissimilar subgroups. These subgroups
are then further divided into dissimilar subgroups; the process
continues until there are as many subgroups as objects. (Will not
be covered in this course.) Nominal - numbers or symbols that do
not imply any ordering. E.g., color, sex, presence or absence.
Ordinal - numbers or symbols where the relation > holds. E.g.
Damage score (1,3,5,7,9), Size (small, medium, large) Interval -
numbers that has the characteristics of an ordinal data and in
addition, the distance between any two numbers on the scales are
known. E.g. Scholastic grades, income Ratio - interval data with a
meaningful zero point. E.g., yield, plant height, tiller number
DATA REQUIREMENTWARNING
STEPS IN PERFORMING HIERARCHICAL CLUSTER ANALYSIS1. Obtain the
Data Matrix
Practically data in all scales are amenable to cluster analysis
The measurement scale of the data affects the manner by which the
resemblance coefficients are computed. But the overall clustering
procedure is unaffected.
2. Standardize the data matrix if need be 3. Generate the
resemblance or distance matrix 4. Execute the Clustering Method
DATAFRAME: Ratio_agro.csvSTEPS IN PERFORMING HIERARCHICAL
CLUSTER ANALYSISStep 1. Obtain the Data Matrix Ratio_agro.csv >
Ratio denAVE par(cex=0.8, par=c(3,2,2,2), cex.axis=0.8) >
plot(denAVE, horiz=FALSE, center=TRUE, nodePar=list(lab.cex =0.6,
lab.col="forest green", pch=NA), main = "Dendrogram using ALINK
Clustering Method") > rect.hclust(tree=AVE, k=4, border=c("red",
"blue", "green", "purple"))
AGGLOMERATIVE METHOD : agnes()
CUTTING A CLUSTER: cutree() Cuts a tree into several groups
either by specifying desired number of groups or the cut heights.
> cutree(tree, k, h) # tree a tree as produced by hclust # k an
integer scalar or vector with desired number of groups # h numeric
scalar or vector with the desired number of groups
CUTTING A CLUSTER: cutree()> AVE2 table(AVE2) AVE2 1 2 3 46 3
2Cluster number Number of units
COPHENETIC CORRELATION Can be used as some kind of measure of
goodness of fit of a particular dendrogram.
cophenetic =
i =1, j =1,i < j
(dn
n
ij
d hij h d
)(2
)2
i =1, j =1,i < j
(d
ij
) (h
ij
h)
Use hclust() and cophenetic()
COPHENETIC CORRELATION : cophenetic() Computes the cophenetic
distances for hierarchical clustering > cophenetic(x) # x an R
representing a hierarchical clustering. > AVECOP cor(Rdist,
AVECOP) [1] 0.8069435
K-MEANS CLUSTERING Different method of clustering, aimed at
finding more homogeneous subgroups within the data Given a number
of k starting points, the data are classified, the centroids
recalculated and the process iterates until stable.
K-MEANS CLUSTERING: kmeans() Perform k-means clustering on a
data matrix > kmeans(x, centers, algorithm) # x A numeric matrix
of data, or an object that can be coerced to such a matrix. #
centers no. of clusters or set of initial distinct clusters #
algorithm character string to speficy the algorithm used in
clustering method. Default algorithm isHartigan-Wong
K-MEANS CLUSTERING: kmeans()> KMRatio names(KMRatio) [1]
"cluster" "size" "centers" "withinss"
Select cluster 1 > grp1 grp1 [1] "08R144" "08R153" "08R160"
"08R164" "08R168" "08R169"
K-MEANS CLUSTERING: plot()> plot(Ratio, col=KMRatio$cluster,
pch=KMRatio$cluster)
K-MEANS CLUSTERING: plot()> plot(prcomp(Ratio,
center=T)$x[,c(1, 2, 3)], col=KMRatio$cluster,
pch=KMRatio$cluster)
DATAFRAME: Flower.csv Data with 8 characteristics for 18 popular
flowers
R APPLICATION DIFFERENT LEVELS OF DATA
V1 asymmetric binary which indicate whether the plant may be
left in the garden when it freezes
DATAFRAME: Flower.csv 8 characteristics for 18 popular flowers
V2 binary, shows whether the plant needs to stand in the shadow V3
asymmetric binary which distinguishes between plants with tubers or
that grow in any other way V4 nominal that specifies the flowers
color V5 ordinal, indicates whether the plant grows in dry, normal
and wet soil. V6 ordinal, preference ranking V7 ratio, plants
height in cm V8 ratio, distance in cm. between plants
DATAFRAME: Flower.csvRead data file Flower.csv > FLWR
str(FLWR)'data.frame': 18 obs. of 8 $ V1: int 0 1 0 0 0 0 0 0 1 $
V2: int 1 0 1 0 1 1 0 0 1 $ V3: int 1 0 0 1 0 0 0 1 0 $ V4: int 4 2
3 4 5 4 4 2 3 $ V5: int 3 1 3 2 2 3 3 2 1 $ V6: int 15 3 1 16 2 12
13 $ V7: int 25 150 150 125 20 $ V8: int 15 50 50 50 15 40
variables: 1 ... 1 ... 0 ... 5 ... 2 ... 7 4 14 ... 50 40... 20
15...
DATAFRAME: Flower.csvConvert V1-V4, from integer to factor >
> > > FLWR$V1 FLWR$V2 FLWR$V3 FLWR$V4 df1 AGN.FLWR
plot(AGN.FLWR, hang=-1) > rect.hclust(tree=AGN.FLWR, h=0.3,
border=c("red", "blue"))
HIERARCHICAL CLUSTERING WITH P-VALUES VIA MULTISCALE BOOTSRAP
RESAMPLING
Package pvclust An R package for assessing the uncertainty in
hierarchical cluster analysis. Pvalues are calculated via
multiscale bootstrap resampling, which indicates how strong the
cluster is supported by the data. Pvalues are between 0 and 1,
which indicates how strong the cluster is supported by data. Two
types of pvalues 1. Approximately unbiased (AU) 2. Bootstrap
Probability Value (BP)
MULTISCALE BOOTSTRAP RESAMPLING A method which calculates the
accuracy of the clusters by means of calculating the pvalues by
resampling of data. For a cluster with pvalue >0.95 reject Ho
(The cluster do not exist). Thus, the cluster exist. The pvalue
calculated is an approximation, with less biased than BP.
STEPS IN PERFROMING HIERACHICAL CLUSTERING WITH PVALUES1.
Perform hierarchical clustering with pvalues via multiscale
bootstrapping 2. Diagnostic Identity cluster with extremely high
standard error value by examining the plot 3. Obtain the estimated
values of the cluster in step2 and evaluate the result. 4. Apply
remedial measure whenever needed requires large number of boostrap
sample size
DATAFRAME: lung.csv DNA Microarray data of lung tumors, where
rows correspond to genes and columns are individuals.
DATAFRAME: lung.csvRead data file lung.csv > LUNG
str(LUNG)'data.frame': 916 obs. of 73 variables: $ X1 : num -0.4
-2.22 -1.35 0.68 NA ... $ X2 : num 4.28 5.21 -0.84 0.56 4.14... $
X3 : num 3.68 4.75 -2.88 -0.45 3.58... $ X4 : num -1.35 -0.91 3.35
-0.2 -0.4... $ X5 : num -1.74 -0.33 3.02 1.14 -... $ X6 : num 2.2
2.56 -4.48 0.22 1.59... . . .
P-VALUES FOR HIERARCHICAL CLUSTERING : pvclust() Performs
hierarchical cluster analysis via function hclust and calculates
pvalues for all the clusters contained in the clustering of
original data, via multiscale bootstrap resampling. >
pvclust(data, method.hclust, method.dist, nboot, r) # x a numeric
matrix, or a dataframe # method.hclust the agglomerative method
used in hierarchical clustering. Same method in argument
hclust()
P-VALUES FOR HIERARCHICAL CLUSTERING : pvclust()>
pvclust(data, method.hclust, method.dist, nboot)
FIND CLUSTERS WITH HIGH PVALUES: pvrect()> pvrect(x, ,pv=au,
alpha=0.95) # x object of class pvclust # alpha threshold value for
pvalues
# method.dist distance measure to be used such as correlation,
uncentered, abscor or same method in dist() # nboot the number of
bootstrap replications. Default is 1000
# pv specify the p-value to be used, either au of bp
PRINT CLUSTERS WITH HIGH PVALUES: pvpick()> pvpick(x, ,pv=au,
alpha=0.95) # x object of class pvclust # alpha threshold value for
pvalues # pv specify the p-value to be used, either au of bp
P-VALUES FOR HIERARCHICAL CLUSTERING : pvclust()> lung.pv
windows(17, 10) > plot(lung.pv, hang=-1, ylim=c(0,1.2)) >
pvrect(lung.pv, alpha=0.95) > pvpick(lung.pv, alpha=0.95)
DENDROGRAM : plot()
pvpick()
DIAGNOSTIC PLOT FOR STANDARD ERROR OF PVALUE : seplot() Draws
diagnostic plot for standard error of p-value for pvclust object
> seplot(object, type, identify=F) # object object of class
pvclust # type the type of p-value to be plotted, can either be au
or bp # identify logical value to specify whether edge numbers can
be indentified interactively.
DIAGNOSTIC PLOT FOR STANDARD ERROR OF PVALUE : seplot()>
seplot(lung.pv, identify =T)
PRINT VALUES : print() A generic function which means that new
printing methods can be easily added for new classes. > print(x,
) # x an object used to select for printing Print result of
clusters 21, 65, and 67 > print(lung.pv, which=c(21,65,67))
Increase large number of bootstrap replication specified in
nboot
REMEDIAL MEASURE
Thank you!