Intermediate R - Cluster Analysis

WHAT IS CLUSTER ANALYSIS? Exploratory technique which may be used to search for category structure based on natural groupings in the data, or reduce a very large body of data to a relatively compact description Cluster Analysis vs. Discriminant Analysis Discriminant : has prespecified groups Cluster : categorization is based on dataSlides prepared by: and

CLUSTER ANALYSIS

Leilani Nora Assistant Scientist

Violeta Bartolome Senior Associate Scientist

PBGB-CRIL

CLUSTER ANALYSIS No assumptions are made concerning the number of groups or the group structure. Grouping is done on the basis of similarities or distances (dissimilarities). There are various techniques in doing this which may give different results. Thus researcher should consider the validity of the clusters found.

ILLUSTRATION

Consider sorting the 16 face cards in an ordinary deck into clusters of similar objects. Cards may be classified by color suit face value etc.

GROUPING FACE CARDS

A K Q J

M

A K Q J

M

A K Q J

M

TWO KINDS OF CLUSTER ANALYSIS

(a) individual cards

(b) individual suits

(c) black and red suits

Hierarchical - involves the construction of tree-like structure (dendrogram) Non-hierarchical - designed to group objects into a collection of k clusters. k may either be specified in advance or determined as part of the clustering procedure.

A K Q J

M

A K Q J

M

A K Q J

M

(d) major and minor suits

(e)hearts plus queen of spades and other suits

(f) like face cards

DATA REQUIREMENT HIERARCHICAL CLUSTER ANALYSIS Agglomerative - starts with individual objects. Most similar objects are first grouped. Eventually, as the similarity decreases, all subgroups are fused into a single cluster. Divisive - an initial single group of objects is divided into two dissimilar subgroups. These subgroups are then further divided into dissimilar subgroups; the process continues until there are as many subgroups as objects. (Will not be covered in this course.) Nominal - numbers or symbols that do not imply any ordering. E.g., color, sex, presence or absence. Ordinal - numbers or symbols where the relation > holds. E.g. Damage score (1,3,5,7,9), Size (small, medium, large) Interval - numbers that has the characteristics of an ordinal data and in addition, the distance between any two numbers on the scales are known. E.g. Scholastic grades, income Ratio - interval data with a meaningful zero point. E.g., yield, plant height, tiller number

DATA REQUIREMENTWARNING

STEPS IN PERFORMING HIERARCHICAL CLUSTER ANALYSIS1. Obtain the Data Matrix

Practically data in all scales are amenable to cluster analysis The measurement scale of the data affects the manner by which the resemblance coefficients are computed. But the overall clustering procedure is unaffected.

2. Standardize the data matrix if need be 3. Generate the resemblance or distance matrix 4. Execute the Clustering Method

DATAFRAME: Ratio_agro.csvSTEPS IN PERFORMING HIERARCHICAL CLUSTER ANALYSISStep 1. Obtain the Data Matrix Ratio_agro.csv > Ratio denAVE par(cex=0.8, par=c(3,2,2,2), cex.axis=0.8) > plot(denAVE, horiz=FALSE, center=TRUE, nodePar=list(lab.cex =0.6, lab.col="forest green", pch=NA), main = "Dendrogram using ALINK Clustering Method") > rect.hclust(tree=AVE, k=4, border=c("red", "blue", "green", "purple"))

AGGLOMERATIVE METHOD : agnes()

CUTTING A CLUSTER: cutree() Cuts a tree into several groups either by specifying desired number of groups or the cut heights. > cutree(tree, k, h) # tree a tree as produced by hclust # k an integer scalar or vector with desired number of groups # h numeric scalar or vector with the desired number of groups

CUTTING A CLUSTER: cutree()> AVE2 table(AVE2) AVE2 1 2 3 46 3 2Cluster number Number of units

COPHENETIC CORRELATION Can be used as some kind of measure of goodness of fit of a particular dendrogram.

cophenetic =

i =1, j =1,i < j

(dn

n

ij

d hij h d

)(2

)2

i =1, j =1,i < j

(d

ij

) (h

ij

h)

Use hclust() and cophenetic()

COPHENETIC CORRELATION : cophenetic() Computes the cophenetic distances for hierarchical clustering > cophenetic(x) # x an R representing a hierarchical clustering. > AVECOP cor(Rdist, AVECOP) [1] 0.8069435

K-MEANS CLUSTERING Different method of clustering, aimed at finding more homogeneous subgroups within the data Given a number of k starting points, the data are classified, the centroids recalculated and the process iterates until stable.

K-MEANS CLUSTERING: kmeans() Perform k-means clustering on a data matrix > kmeans(x, centers, algorithm) # x A numeric matrix of data, or an object that can be coerced to such a matrix. # centers no. of clusters or set of initial distinct clusters # algorithm character string to speficy the algorithm used in clustering method. Default algorithm isHartigan-Wong

K-MEANS CLUSTERING: kmeans()> KMRatio names(KMRatio) [1] "cluster" "size" "centers" "withinss"

Select cluster 1 > grp1 grp1 [1] "08R144" "08R153" "08R160" "08R164" "08R168" "08R169"

K-MEANS CLUSTERING: plot()> plot(Ratio, col=KMRatio$cluster, pch=KMRatio$cluster)

K-MEANS CLUSTERING: plot()> plot(prcomp(Ratio, center=T)$x[,c(1, 2, 3)], col=KMRatio$cluster, pch=KMRatio$cluster)

DATAFRAME: Flower.csv Data with 8 characteristics for 18 popular flowers

R APPLICATION DIFFERENT LEVELS OF DATA

V1 asymmetric binary which indicate whether the plant may be left in the garden when it freezes

DATAFRAME: Flower.csv 8 characteristics for 18 popular flowers V2 binary, shows whether the plant needs to stand in the shadow V3 asymmetric binary which distinguishes between plants with tubers or that grow in any other way V4 nominal that specifies the flowers color V5 ordinal, indicates whether the plant grows in dry, normal and wet soil. V6 ordinal, preference ranking V7 ratio, plants height in cm V8 ratio, distance in cm. between plants

DATAFRAME: Flower.csvRead data file Flower.csv > FLWR str(FLWR)'data.frame': 18 obs. of 8 $ V1: int 0 1 0 0 0 0 0 0 1 $ V2: int 1 0 1 0 1 1 0 0 1 $ V3: int 1 0 0 1 0 0 0 1 0 $ V4: int 4 2 3 4 5 4 4 2 3 $ V5: int 3 1 3 2 2 3 3 2 1 $ V6: int 15 3 1 16 2 12 13 $ V7: int 25 150 150 125 20 $ V8: int 15 50 50 50 15 40 variables: 1 ... 1 ... 0 ... 5 ... 2 ... 7 4 14 ... 50 40... 20 15...

DATAFRAME: Flower.csvConvert V1-V4, from integer to factor > > > > FLWR$V1 FLWR$V2 FLWR$V3 FLWR$V4 df1 AGN.FLWR plot(AGN.FLWR, hang=-1) > rect.hclust(tree=AGN.FLWR, h=0.3, border=c("red", "blue"))

HIERARCHICAL CLUSTERING WITH P-VALUES VIA MULTISCALE BOOTSRAP RESAMPLING

Package pvclust An R package for assessing the uncertainty in hierarchical cluster analysis. Pvalues are calculated via multiscale bootstrap resampling, which indicates how strong the cluster is supported by the data. Pvalues are between 0 and 1, which indicates how strong the cluster is supported by data. Two types of pvalues 1. Approximately unbiased (AU) 2. Bootstrap Probability Value (BP)

MULTISCALE BOOTSTRAP RESAMPLING A method which calculates the accuracy of the clusters by means of calculating the pvalues by resampling of data. For a cluster with pvalue >0.95 reject Ho (The cluster do not exist). Thus, the cluster exist. The pvalue calculated is an approximation, with less biased than BP.

STEPS IN PERFROMING HIERACHICAL CLUSTERING WITH PVALUES1. Perform hierarchical clustering with pvalues via multiscale bootstrapping 2. Diagnostic Identity cluster with extremely high standard error value by examining the plot 3. Obtain the estimated values of the cluster in step2 and evaluate the result. 4. Apply remedial measure whenever needed requires large number of boostrap sample size

DATAFRAME: lung.csv DNA Microarray data of lung tumors, where rows correspond to genes and columns are individuals.

DATAFRAME: lung.csvRead data file lung.csv > LUNG str(LUNG)'data.frame': 916 obs. of 73 variables: $ X1 : num -0.4 -2.22 -1.35 0.68 NA ... $ X2 : num 4.28 5.21 -0.84 0.56 4.14... $ X3 : num 3.68 4.75 -2.88 -0.45 3.58... $ X4 : num -1.35 -0.91 3.35 -0.2 -0.4... $ X5 : num -1.74 -0.33 3.02 1.14 -... $ X6 : num 2.2 2.56 -4.48 0.22 1.59... . . .

P-VALUES FOR HIERARCHICAL CLUSTERING : pvclust() Performs hierarchical cluster analysis via function hclust and calculates pvalues for all the clusters contained in the clustering of original data, via multiscale bootstrap resampling. > pvclust(data, method.hclust, method.dist, nboot, r) # x a numeric matrix, or a dataframe # method.hclust the agglomerative method used in hierarchical clustering. Same method in argument hclust()

P-VALUES FOR HIERARCHICAL CLUSTERING : pvclust()> pvclust(data, method.hclust, method.dist, nboot)

FIND CLUSTERS WITH HIGH PVALUES: pvrect()> pvrect(x, ,pv=au, alpha=0.95) # x object of class pvclust # alpha threshold value for pvalues

# method.dist distance measure to be used such as correlation, uncentered, abscor or same method in dist() # nboot the number of bootstrap replications. Default is 1000

# pv specify the p-value to be used, either au of bp

PRINT CLUSTERS WITH HIGH PVALUES: pvpick()> pvpick(x, ,pv=au, alpha=0.95) # x object of class pvclust # alpha threshold value for pvalues # pv specify the p-value to be used, either au of bp

P-VALUES FOR HIERARCHICAL CLUSTERING : pvclust()> lung.pv windows(17, 10) > plot(lung.pv, hang=-1, ylim=c(0,1.2)) > pvrect(lung.pv, alpha=0.95) > pvpick(lung.pv, alpha=0.95)

DENDROGRAM : plot()

pvpick()

DIAGNOSTIC PLOT FOR STANDARD ERROR OF PVALUE : seplot() Draws diagnostic plot for standard error of p-value for pvclust object > seplot(object, type, identify=F) # object object of class pvclust # type the type of p-value to be plotted, can either be au or bp # identify logical value to specify whether edge numbers can be indentified interactively.

DIAGNOSTIC PLOT FOR STANDARD ERROR OF PVALUE : seplot()> seplot(lung.pv, identify =T)

PRINT VALUES : print() A generic function which means that new printing methods can be easily added for new classes. > print(x, ) # x an object used to select for printing Print result of clusters 21, 65, and 67 > print(lung.pv, which=c(21,65,67)) Increase large number of bootstrap replication specified in nboot

REMEDIAL MEASURE

Thank you!

Intermediate R - Cluster Analysis

Documents

groups cluster

cluster dataframe

single cluster

kinds of cluster analysis

ordinal data

hierarchical cluster

numeric matrix of data

data matrix kmeansx