Top Banner
UVA CS 4501: Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of Computer Science 4/24/18 Dr. Yanjun Qi / UVA CS 1
54

UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Jul 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

UVACS4501:MachineLearning

Lecture22:UnsupervisedClustering(I)

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

4/24/18

Dr.YanjunQi/UVACS

1

Page 2: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Wherearewe?èmajorsecJonsofthiscourse

q Regression(supervised)q ClassificaJon(supervised)

q FeatureselecJonq Unsupervisedmodels

q DimensionReducJon(PCA)q Clustering(K-means,GMM/EM,Hierarchical)

q Learningtheoryq Graphicalmodels

4/24/18

Dr.YanjunQi/UVACS

2

Page 3: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

AnunlabeledDatasetX

•  Data/points/instances/examples/samples/records:[rows]•  Features/a0ributes/dimensions/independentvariables/covariates/predictors/regressors:[columns]

4/24/18

Dr.YanjunQi/UVACS

a data matrix of n observations on p variables x1,x2,…xp

Unsupervisedlearning=learningfromraw(unlabeled,unannotated,etc)data,asopposedtosuperviseddatawherelabelofexamplesisgiven

3

Page 4: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Today:Whatisclustering?

•  Arethereany“groups”?•  Whatiseachgroup?•  Howmany?•  HowtoidenJfythem?

4

Page 5: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

•  Find groups (clusters) of data points such that data points in a group will be similar (or related) to one another and different from (or unrelated to) the data points in other groups

Whatisclustering?

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

4/24/18

Dr.YanjunQi/UVACS

5

Page 6: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Whatisclustering?•  Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjects–  highintra-classsimilarity–  lowinter-classsimilarity–  Itisthecommonestformofunsupervisedlearning

•  AcommonandimportanttaskthatfindsmanyapplicaJonsinScience,Engineering,informaJonScience,andotherplaces,e.g.

•  GroupgenesthatperformthesamefuncJon•  GroupindividualsthathassimilarpoliJcalview•  Categorizedocumentsofsimilartopics•  Idealitysimilarobjectsfrompictures

6

Page 7: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Whatisclustering?•  Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjects–  highintra-classsimilarity–  lowinter-classsimilarity–  Itisthecommonestformofunsupervisedlearning

•  AcommonandimportanttaskthatfindsmanyapplicaJonsinScience,Engineering,informaJonScience,andotherplaces,e.g.

•  GroupgenesthatperformthesamefuncJon•  GroupindividualsthathassimilarpoliJcalview•  Categorizedocumentsofsimilartopics•  Idealitysimilarobjectsfrompictures

7

Page 8: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

ToyExamples•  People

•  Images

•  Language

•  species

8

Page 9: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Application (I): Search

Result Clustering

4/24/18 9

Dr.YanjunQi/UVACS

Page 10: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Application (II): Navigation

4/24/18 10

Dr.YanjunQi/UVACS

Page 11: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Issuesforclustering•  Whatisanaturalgroupingamongtheseobjects?

–  DefiniJonof"groupness"•  Whatmakesobjects“related”?

–  DefiniJonof"similarity/distance"•  RepresentaJonforobjects

–  Vectorspace?NormalizaJon?•  Howmanyclusters?

–  Fixedapriori?–  Completelydatadriven?

•  Avoid“trivial”clusters-toolargeorsmall•  ClusteringAlgorithms

–  ParJJonalalgorithms–  Hierarchicalalgorithms

•  FormalfoundaJonandconvergence11

Page 12: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence12

Page 13: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Whatisanaturalgroupingamongtheseobjects?

13

Page 14: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Anotherexample:clusteringissubjecJve

A

B

A

B

A

B

A

B A

B

A

B

TwopossibleSoluJons…

4/24/18 Dr.YanjunQi/UVACS 14

Page 15: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence15

Page 16: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

WhatisSimilarity?

•  TherealmeaningofsimilarityisaphilosophicalquesJon.WewilltakeamorepragmaJcapproach

•  DependsonrepresentaJonandalgorithm.Formanyrep./alg.,easiertothinkintermsofadistance(ratherthansimilarity)betweenvectors.

Hardtodefine!Butweknowitwhenweseeit

16

Page 17: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

WhatproperJesshouldadistancemeasurehave?

•  D(A,B)=D(B,A) Symmetry

•  D(A,A)=0 ConstancyofSelf-Similarity

•  D(A,B)=0IIfA=B Posi=vitySepara=on

•  D(A,B)<=D(A,C)+D(B,C) TriangularInequality

17

Page 18: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

•  D(A,B)=D(B,A) Symmetry–  Otherwiseyoucouldclaim"AlexlookslikeBob,butBoblooksnothing

likeAlex"

•  D(A,A)=0 ConstancyofSelf-Similarity–  Otherwiseyoucouldclaim"AlexlooksmorelikeBob,thanBobdoes"

•  D(A,B)=0IIfA=B Posi=vitySepara=on–  Otherwisethereareobjectsinyourworldthataredifferent,butyou

cannottellapart.

•  D(A,B)<=D(A,C)+D(B,C) TriangularInequality–  Otherwiseyoucouldclaim"AlexisverylikeBob,andAlexisverylike

Carl,butBobisveryunlikeCarl"

IntuiJonsbehinddesirableproperJesofdistancemeasure

18

Page 19: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

DistanceMeasures:MinkowskiMetric

•  Supposetwoobjectxandybothhavepfeatures

•  TheMinkowskimetricisdefinedby•  MostCommonMinkowskiMetrics

!!d(x , y)= |xi− yi

i=1

p

∑ |rr

!!

x = (x1 ,x2 ,!,xp)y = ( y1 , y2 ,!, yp)

1,r =2(Euclideandistance)d(x , y)= |xi− yii=1

p

∑ |22

2,r =1(Manhattandistance)d(x , y)= |xi− yii=1

p

∑ |

3,r = +∞("sup"distance)d(x , y)=max1≤i≤p

|xi− yi |19

Page 20: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

.},{max :distance sup"" :3. :distanceManhattan :2

. :distanceEuclidean :1

434734

5342 22

==+

=+

AnExample

4

3

x

y

20

Page 21: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

.},{max :distance sup"" :3. :distanceManhattan :2

. :distanceEuclidean :1

434734

5342 22

==+

=+

AnExample

4

3

x

y

21

Page 22: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

11011111100001110100111001001001101716151413121110987654321

GeneBGeneA

. :Distance Hamming 5141001 =+=+ )#()#(

•  ManhanandistanceiscalledHammingdistancewhenallfeaturesarebinaryordiscrete.

–  E.g.,GeneExpressionLevelsUnder17CondiJons(1-High,0-Low)

Hammingdistance:discretefeatures

!!d(x , y)= |xi− yi

i=1

p

∑ |

22

Page 23: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

EditDistance:Agenerictechniqueformeasuringsimilarity

•  Tomeasurethesimilaritybetweentwoobjects,transformoneoftheobjectsintotheother,andmeasurehowmucheffortittook.Themeasureofeffortbecomesthedistancemeasure.

ThedistancebetweenPanyandSelma.

Changedresscolor,1pointChangeearringshape,1pointChangehairpart,1point

D(Pany,Selma)=3

ThedistancebetweenMargeandSelma.

Changedresscolor,1pointAddearrings,1pointDecreaseheight,1pointTakeupsmoking,1pointLoseweight,1point

D(Marge,Selma)=5

ThisiscalledtheEditdistanceortheTransformaJondistance23

SelmaPanyMarge

Page 24: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

•  PearsoncorrelaJoncoefficient

•  Specialcase:cosinedistance4/24/18

Dr.YanjunQi/UVACS

. and where

)()(

))((),(

∑∑

∑ ∑

==

= =

=

==

−×−

−−=

p

iip

p

iip

p

i

p

iii

p

iii

yyxx

yyxx

yyxxyxs

1

1

1

1

1 1

22

1

1≤),( yxs

SimilarityMeasures:CorrelaJonCoefficient

yxyxyxs !!!!

⋅⋅=),(

•  MeasuringthelinearcorrelaLonbetweentwosequences,xandy,

•  givingavaluebetween+1and−1inclusive,where1istotalposiJvecorrelaLon,0isnocorrelaLon,and−1istotalnegaJvecorrelaLon.

CorrelaJonisunitindependent

24

Page 25: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

SimilarityMeasures:e.g.,CorrelaJonCoefficientonJmeseriessamples

Time

Gene A

Gene B

Gene A Time

Gene B

Expression Level Expression Level

Expression Level

Time

Gene A Gene B

25

CorrelaJonisunitindependent;IfyouscaleoneoftheobjectstenJmes,youwillgetdifferenteuclideandistancesandsamecorrelaJondistances.

Page 26: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence26

Page 27: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

ClusteringAlgorithms

•  ParJJonalalgorithms– Usuallystartwitharandom(parJal)parJJoning

–  RefineititeraJvely•  Kmeansclustering•  Mixture-Modelbasedclustering

•  Hierarchicalalgorithms–  Bonom-up,agglomeraJve–  Top-down,divisive

27

Page 28: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

ClusteringAlgorithms

•  ParJJonalalgorithms– Usuallystartwitharandom(parJal)parJJoning

–  RefineititeraJvely•  Kmeansclustering•  Mixture-Modelbasedclustering

•  Hierarchicalalgorithms–  Bonom-up,agglomeraJve–  Top-down,divisive

28

Page 29: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

TodayRoadmap:clustering

§  DefiniJonof"groupness”§  DefiniJonof"similarity/distance"§  RepresentaJonforobjects§  Howmanyclusters?§  ClusteringAlgorithms

§ ParJJonalalgorithms§ Hierarchicalalgorithms

§  FormalfoundaJonandconvergence29

Page 30: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

HierarchicalClustering•  Buildatree-basedhierarchicaltaxonomy(dendrogram)fromasetofobjects,e.g.organisms,documents.

•  NotethathierarchiesarecommonlyusedtoorganizeinformaJon,forexampleinawebportal.–  Yahoo!hierarchyismanuallycreated,wewillfocusonautomaJccreaJonofhierarchies

Withbackbone Withoutbackbone

30

Page 31: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

(How-to) Hierarchical Clustering The number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10  34,459,425

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjectsè

highintra-classsimilaritylowinter-classsimilarity

4/24/18

Dr.YanjunQi/UVACS

31

Page 32: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

(How-to) Hierarchical Clustering The number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10  34,459,425

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjectsè

highintra-classsimilaritylowinter-classsimilarity

Agreedylocal

opJmalsoluJon

4/24/18

Dr.YanjunQi/UVACS

32

Page 33: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

33

(How-to) Hierarchical Clustering The number of dendrograms with n leafs

= (2n -3)!/[(2(n -2)) (n -2)!]

Number Number of Possibleof Leafs Dendrograms 2 13 34 155 105... …10  34,459,425

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

Clustering:theprocessofgroupingasetofobjectsintoclassesofsimilarobjectsè

highintra-classsimilaritylowinter-classsimilarity

Agreedylocal

opJmalsoluJon

Page 34: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

0 8 8 7 7

0 2 4 4

0 3 3

0 1

0

D( , ) = 8 D( , ) = 1

We begin with a distance matrix which contains the distances between every pair of objects in our database.

4/24/18

Dr.YanjunQi/UVACS

34

Page 35: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

4/24/18

Dr.YanjunQi/UVACS

35

Page 36: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

4/24/18

Dr.YanjunQi/UVACS

36

Page 37: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

Consider all possible merges…

Choose the best …

4/24/18

Dr.YanjunQi/UVACS

37

Page 38: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Bottom-Up (agglomerative): Starting with each item in its own cluster, find the best pair to merge into a new cluster. Repeat until all clusters are fused together.

… Consider all possible merges…

Choose the best

Consider all possible merges… …

Choose the best

Consider all possible merges…

Choose the best … But how do we compute distances

between clusters rather than objects?

4/24/18

Dr.YanjunQi/UVACS

38

Page 39: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

Howtodecidethedistancesbetweenclusters?

•  Single-Link

– NearestNeighbor:theirclosestmembers.

•  Complete-Link– FurthestNeighbor:theirfurthestmembers.

•  Average:– averageofallcross-clusterpairs.

39

Page 40: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Computing distance between clusters: Single Link

•  cluster distance = distance of two closest members in each class

- Potentially long and skinny clusters

4/24/18

Dr.YanjunQi/UVACS

40

Page 41: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Computing distance between clusters: : Complete Link

•  cluster distance = distance of two farthest members

+ tight clusters

4/24/18

Dr.YanjunQi/UVACS

41

Page 42: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Computing distance between clusters: Average Link

•  cluster distance = average distance of all pairs

the most widely used measure

Robust against noise

4/24/18

Dr.YanjunQi/UVACS

42

Page 43: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

12 3 4

5

4/24/18

Dr.YanjunQi/UVACS

43

Page 44: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

12 3 4

5

4/24/18

Dr.YanjunQi/UVACS

44

Page 45: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

12 3 4

5

4/24/18

Dr.YanjunQi/UVACS

45

Page 46: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

12 3 4

5

8}8,9min{},min{9}9,10min{},min{3}3,6min{},min{

5,25,15),2,1(

4,24,14),2,1(

3,23,13),2,1(

======

===

ddddddddd

4/24/18

Dr.YanjunQi/UVACS

46

Page 47: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥

⎢⎢⎢

04507

0

54)3,2,1(

54)3,2,1(

12 3 4

5

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

5}5,8min{},min{7}7,9min{},min{

5,35),2,1(5),3,2,1(

4,34),2,1(4),3,2,1(

======

dddddd

4/24/18

Dr.YanjunQi/UVACS

47

Page 48: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Example: single link

⎥⎥⎥

⎢⎢⎢

04507

0

54)3,2,1(

54)3,2,1(

12 3 4

5

⎥⎥⎥⎥

⎢⎢⎢⎢

0458079

030

543)2,1(

543)2,1(

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

0458907910

03602

0

54321

54321

5},min{ 5),3,2,1(4),3,2,1()5,4(),3,2,1( == ddd

4/24/18

Dr.YanjunQi/UVACS

48

Page 49: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

1

2

3

4

5

6

7

Average linkage

Single linkage

Height represents distance between objects / clusters

ParJJonsbycutngthedendrogramatadesiredlevel:eachconnectedcomponentformsacluster.

4/24/18

Dr.YanjunQi/UVACS

49

Page 50: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

HierarchicalClustering•  Bonom-UpAgglomeraJveClustering

–  Startswitheachobjectinaseparatecluster–  thenrepeatedlyjoinstheclosestpairofclusters,–  unJlthereisonlyonecluster.

Thehistoryofmergingformsabinarytreeorhierarchy(dendrogram)

•  Top-Downdivisive–  StarJngwithallthedatainasinglecluster,–  Considereverypossiblewaytodividetheclusterintotwo.Choosethebestdivision

–  Andrecursivelyoperateonbothsides.

50

Page 51: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

4/24/18

Dr.YanjunQi/UVACS

ComputaJonalComplexity

•  InthefirstiteraJon,allHACmethodsneedtocomputesimilarityofallpairsofnindividualinstanceswhichisO(n2p).

•  Ineachofthesubsequentn−2mergingiteraJons,computethedistancebetweenthemostrecentlycreatedclusterandallotherexisJngclusters.

•  Forthesubsequentsteps,inordertomaintainanoverallO(n2)performance,compuJngsimilaritytoeachotherclustermustbedoneinconstantJme.ElseO(n2logn)orO(n3)ifdonenaively

51

Page 52: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

SummaryofHierarchalClusteringMethods

•  Noneedtospecifythenumberofclustersinadvance.

•  HierarchicalstructuremapsnicelyontohumanintuiJonforsomedomains

•  Theydonotscalewell:JmecomplexityofatleastO(n2),wherenisthenumberoftotalobjects.

•  LikeanyheurisJcsearchalgorithms,localopJmaareaproblem.

•  InterpretaJonofresultsis(very)subjecJve.4/24/18

Dr.YanjunQi/UVACS

52

Page 53: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

Hierarchical Clustering

Clustering

n/a

No clearly defined loss

greedy bottom-up (or top-down)

Dendrogram (tree)

Task

Representation

Score Function

Search/Optimization

Models, Parameters

4/24/18 53

Dr.YanjunQi/UVACS

Page 54: UVA CS 4501: Machine Learning Lecture 22: Unsupervised ...€¦ · Machine Learning Lecture 22: Unsupervised Clustering (I) Dr. Yanjun Qi University of Virginia Department of ...

References

q HasJe,Trevor,etal.Theelementsofsta=s=callearning.Vol.2.No.1.NewYork:Springer,2009.

q BigthankstoProf.EricXing@CMUforallowingmetoreusesomeofhisslides

q BigthankstoProf.ZivBar-Joseph@CMUforallowingmetoreusesomeofhisslides

4/24/18

Dr.YanjunQi/UVACS

54