Top Banner
Statistics for Research II (Multivariate Techniques) Reference Materials Available articles (websites) Lecture-notes/power-points and the book: Multivariate Analysis By Joseph Hair, William Black, Barry Babin, Rolph Anderson, Ronald Dixon
33

Cluster Analysis Techniques

Aug 18, 2015

Download

Documents

Cluster analysis
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript

Statistics for Research II(Multivariate Techniques) Reference MaterialsAvailable articles (websites)Lecture-notes/power-pointsand the book: Multivariate AnalysisBy Joseph Hair, William Black, Barry Babin, Rolph Anderson, Ronald DixonMultivariate TechniquesMultiple RegressionDiscriminant AnalysisConjoint Analysisactor AnalysisCluster AnalysisMulti!imensional ScalingDepen!ence TechniquesInter!epen!ence Techniques"roa! #earning $%jectives &' ()plain cluster analysis an! its role in multivariate Analysis*' +n!erstan! ho, customer-pro!uct-firm etc similarity is measure!'.' +n!erstan! the !ifferences %et,een clustering techniques'/' Interpret the results of a cluster analysis'$%jective of Cluster AnalysisThe o%jective of cluster analysis is to group population-sample o%servations in !ifferent groups-cluster so that o%servations ,ithin each group-cluster are similar to one another an! the groups-clusters themselves stan! apart from one another' In other ,or!s0 the o%jective is to !ivi!e the items-o%jects ofpopulation-sample 1 customers-pro!ucts- firms etc 1 into homogeneous an! !istinct groups ,ith respect tosome varia%les-attri%utes of interest' Conceptual frame,or2 of research pro%lems an! its o%jectives suggests ,hat varia%les-attri%utes shoul! %e use! for clustering such items-o%jects$%jective of Cluster Analysis (continue!)Cluster Analysis may %e use! in %usiness for segmenting mar2ets %y i!entifying customers of similar nee!s0 trac2ing commo!ities in or!er of preference-quality0 etc' an! for 2no,ing the !ynamics involve! in their changes ,ith time' Cluster analysis is also occasionally use! to group varia%les into homogeneous an! !istinct groups' This approach is use!0 for e)ample0 in revising a questionnaire on the %asis of responses receive! to a !rafte! questionnaire' The grouping of the questions %y means of cluster analysis helps to i!entify re!un!ant questions an! re!uce their num%er'A Simple ()ample of Clustering()ample&3 Assume the !aily e)pen!itures on foo! (4&) an! clothing (4*) of five persons given in the Ta%le %elo,5erson 4& 4*a* /%6 *c7 .!& 8e6'8&The !ata of this ta%le are plotte! in the figure in ne)t sli!e ()plaining Metho!s of Clustering 9 (cont!')Figure: rouping of observations

!nspection of figure suggests that the five observations for" two clusters # one consisting of persons a and d$ and the other of b$ c and e% &he observations in each cluster are si"ilar with respect to e'penditures on food and clothing$ and that the two clusters are (uite distinct fro" each other% !t is i"practical to e'a"ine all possible clusters of availableobservations to for" ho"ogeneous and distinct clusters b) su""ari*ing each clustering according to degree of pro'i"it) a"ong the cluster ele"ents and of the separation of clusters$ when observations are obtained for "an) variables/attributes and nu"ber observations is large% 5opular Clustering Metho!s&here are "an) clustering "ethods% +ne "ethod$ for e'a"ple$ begins with as "an) groups as there are observations$ and then s)ste"aticall) "erges observations to reduce the nu"ber of groups b) one$ two$ : : :$ until a single group containing all observations is for"ed% Another "ethod begins with a given nu"ber of groups and an arbitrar) assign"ent of the observations to the groups$ and then reassigns the observations one b) one so that ulti"atel) each observation belongs to the nearest group% &hese "ethods e"plo) a distance "easure or a "easure of association to indicate precisel) the degree of si"ilarit) (pro'i"it)/nearness/closeness) a"ong observations Measures of Distance ,efinition (Measures of distance): &he -uclidean distance between the two points i and j is the h)potenuse of the triangle A./:

+bservation i is closer ("ore si"ilar) to 0 than observation k if ,(i1 0) 2 ,(i1k)% An alternative "easure is the s(uared -uclidean distance: ,3(! $ 0) 4 A3 5 .3 4 (67i - 670)3 5 (63i - 630)3 8et another "easure is the cit) block distance$ defined as,9(! $ 0) 4 + B= 1 1 + 2 2Measures of Distance(cont!') All three "easures of distance depend on the units in which 67 and 63 are "easured$ and are influenced b) whichevervariable takes nu"ericall) larger values% For this reason$ the variables are often standardi*ed so that the) have "ean : and variance 7 before cluster anal)sis is applied% Alternativel)$ weights w7$ w3$ : : :$ wk reflecting the i"portance of the variables could be used and a weighted "easure of distance calculated%:earest :eigh%or Metho!s of Clustering .egin with as "an) clusters as there are observations$ that is$ with each observation for"ing a separate cluster% Merge that pair of observations that are nearest one another$ leaving n - 7 clusters for the ne't step% ;e't$ "erge into one cluster that pair of clusters that are nearest one another$ leaving n - 3 clusters for the ne't step% /ontinue in this fashion$ reducing the nu"ber of clusters b) one at each step$ until a single cluster consisting of all n observations is for"ed% At each above step$ keep track of the distance at which the clusters are for"ed% !n order to deter"ine the nu"ber of clusters$ consider the step(s) at which the "erging distance is relativel) large% A proble" with this procedure is how to "easure the distance between clusters consisting of two or "ore observations% e arbitraril) select (ad) as the new cluster and obtain three clusters: ad$ be and cas shown in Figure 3(b)%&he distance between (be) and (ad) is,(be $ ad) 4 "in ?,(be $ a) $ ,?be $ d@ 4 "in ?A%93B $ C%A7A@ 4 A%93B 1while that between c and (ad) is,(c $ ad) 4 "in ?,(c $ a) $ ,(c $ d)@ 4 "in ?C%:C7 $ F%3EA@ 4 C%:C7%&he three clusters re"aining at this step and the distances between these clusters are shown in the ne't Figure 9(a)% Figure 3:earest :eigh%or Metho!3 ()ample (continue!);earest neighbor "ethod$ =tep 9:A"ong these three clusters$ "erge the two that are closest$ i%e% "erge (be) with c to for" the cluster (bce) shown in Figure 9(b)%&he distance between the two re"aining clusters is,(ad1 bce) 4 "in?,(ad $ be) $ ,(ad $ c)@ 4 "in ?A%93B $C%:C7@ 4 A%93B&he grouping of these two clusters$ it will be noted$ occurs at a distance of A%93B$ a "uch greater distance than that at which the earlier groupings took place% Figure E shows the final grouping%Figure 4:earest :eigh%or Metho!3 ()ample (continue!);earest neighbor "ethod$ =tep E:&he groupings and the distance at which these took place are also shown in the tree diagra" (dendrogra") of Figure B% Figure 5 (dendrogra"):earest :eigh%or Metho!3 ()ample (continue!)+ne usuall) searches the dendrogra" for large 0u"ps in the grouping distance as guidance in arriving at the nu"ber of groups% !n this illustration$ it is clear that the ele"ents in each of the clusters (ad) and (bce) are close (the) were "erged at a s"all distance)$ but the clusters the"selves are far apart (the distance at which the) "erge is large)% Gence$ this "ethod of clustering shows (ad) and (bce) as the clusters we are looking for%------------------------------------------------------------------------------------- arthest :eigh%or Metho! of Clustering&he nearest neighbor is not the onl) "ethod for "easuring the distance between clusters% Hnder the farthest neighbor (or co"plete linkage) "ethod$ the distant between the two clusters is the distance between their two "ost distant "e"bers% =ee the ad0oining Figure%Figure: /luster distance$ Farthest neighbor "ethodarthest :eigh%or Metho!3 the Simple ()ample&his "ethod of clustering of observations in e'a"ple 7 can also be started fro" Figure7%+bviousl)$ farthest neighbor "ethod also calls for grouping b and e at =tep 7% Gowever$ the distances between (be)$ on the one hand$ andthe clusters (a)$ (c)$ and (d)$ on the other$ are different fro" those obtained with nearest neighbor "ethod% arthest :eigh%or Metho!3 ()ample (continue!),(be $ a) 4 "a' ?,(b $ a) $ ,(e $ a)@ 4 "a' ?A%93B $ C%7BD@ 4 C%7BD,(be $ c) 4 "a' ?,(b $ c) $ ,(e $ c)@ 4 "a' ?7%E7E $ 3%:A3@ 4 3%:A3,(be $ d) 4 "a' ?,(b $ d) $ ,(e $ d)@ 4 "a' ?C%A7A $ F%B::@ 4 F%B::&he four clusters re"aining at =tep 7 and the distances between these clusters are shown in Figure A(a)%Figure 6!n step 3$ the nearest clusters (a) and (d) are grouped into cluster (ad)% &he re"aining steps are si"ilarl) e'ecuted% Average #in2age Metho! of Clustering&he nearest and farthest neighbor "ethods produce the sa"e results in -'a"ple 7% !n other cases$ however$ the two "ethods "a) not agree% For instance$ consider the following Figure%Figure: &wo cluster ? means metho!3 ()ample &(continue!)>e now calculate the distance of a and b fro" the two centroids:=ince a is closer to the centroid of /luster 7$ to which it belongs$ a is not reassigned% =ince b is closer to /luster 3Js centroid than to that of /luster 7$ it is reassigned to /luster 3% &he new cluster centroids are calculated as shown in Figure F(a) ne't slide and plotted in Figure F(b)% @ariants of >?means metho!+ther variants of the I-"eans "ethod re(uire that the first cluster centroids (the KseedsK$ as the) are so"eti"es called) be specified% &hese seeds could be observations% +bservations within a specified distance fro" a centroid are then included in the cluster% !n so"e variants$ the first observation found to be nearer another cluster centroid is i""ediatel) reassigned and the new centroids recalculated% !n others reassign"ent and recalculation await until all observations are e'a"ined and one observation is selected on the basis of certain criteria% &he L(uickM or LfastM clustering procedures used b) co"puter progra"s such as =A= or =