Top Banner
K-MEAN CLUSTER BY CHENG ZHAN HOUSTON MACHINE LEARNING MEETUP 1/7/2017
54

K means and dbscan

Apr 15, 2017

Download

Data & Analytics

Yan Xu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: K means and dbscan

K-MEAN CLUSTERBY CHENG ZHAN

HOUSTON MACHINE LEARNING MEETUP

1/7/2017

Page 2: K means and dbscan
Page 3: K means and dbscan
Page 4: K means and dbscan
Page 5: K means and dbscan

INTRODUCTION

• K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.

• The main idea is to define k centroids, one for each cluster.

Page 6: K means and dbscan

• Input

• M(set of points)

• k(number of clusters)

• Output

• μ_1 , …, μ_k (cluster centroids)

• k-Means clusters the M point into K clusters by minimizing the squared error function

μ

Page 7: K means and dbscan

HOW TO PICK K

Page 8: K means and dbscan

K-MEAN ALGORITHM

• 0. Initialize cluster centers

• 1. Assign observations to closest

cluster center

• 2. Revise cluster centers as mean of

assigned observations

• 3. Repeat 1&2 until convergence

Page 9: K means and dbscan

K-MEAN ALGORITHM

• 0. Initialize cluster centers

• 1. Assign observations to closest

cluster center

• 2. Revise cluster centers as mean of

assigned observations

• 3. Repeat 1&2 until convergence

Page 10: K means and dbscan

K-MEAN ALGORITHM

• 0. Initialize cluster centers

• 1. Assign observations to closest

cluster center

• 2. Revise cluster centers as mean of

assigned observations

• 3. Repeat 1&2 until convergence

Page 11: K means and dbscan

K-MEAN ALGORITHM

• 0. Initialize cluster centers

• 1. Assign observations to closest

cluster center

• 2. Revise cluster centers as mean of

assigned observations

• 3. Repeat 1&2 until convergence

Page 12: K means and dbscan
Page 13: K means and dbscan
Page 14: K means and dbscan
Page 15: K means and dbscan
Page 16: K means and dbscan

GOOD INITIAL POINTS

Page 17: K means and dbscan

UNLUCKY

Page 18: K means and dbscan

K-MEANS IN PRACTICE

• How to choose initial centroids

• select randomly among the data points

• generate completely randomly

• How to choose k

• study the data

• run k-Means for different k (measure squared error for each k)

• Run k-means many times!

• Get many choices of initial points

Page 19: K means and dbscan

WHAT ABOUT THIS?

Page 20: K means and dbscan

QUESTIONS

• Euclidean distance results in spherical clusters

• What cluster shape does the Manhattan distance give?

• Think of other distance measures. What cluster shapes will those yield?

Page 21: K means and dbscan

DBSCAN

Page 22: K means and dbscan

DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONWITH NOISE

• DBSCAN is a Density-Based Clustering algorithm

• In density based clustering we partition points into dense regions separated by not-so-dense regions.

• Important Questions:

• How do we measure density and what is a dense region?

• DBSCAN:

• Density at point p: number of points within a circle of radius Eps

• Dense Region: A circle of radius Eps that contains at least MinPts points

Page 23: K means and dbscan

WHEN DBSCAN WORKS WELL

Page 24: K means and dbscan

WHEN DBSCAN WORKS WELL

Page 25: K means and dbscan

DBSCAN

Page 26: K means and dbscan

DEFINITIONS

Page 27: K means and dbscan

REACHABILITY AND CONNECTIVITY

Page 28: K means and dbscan

DBSCAN ALGORITHM: EXAMPLE

Page 29: K means and dbscan

DBSCAN ALGORITHM: EXAMPLE

Page 30: K means and dbscan

DBSCAN ALGORITHM: EXAMPLE

Page 31: K means and dbscan

DETERMINING EPS & MINPTS

• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

• Noise points have the kth nearest neighbor at farther distance

• So, plot sorted distance of every point to its kth nearest neighbor

• Find the distance d where there is a “knee” in the curve

• Eps = d, MinPts = k

Page 32: K means and dbscan
Page 33: K means and dbscan

SENSITIVE TO PARAMETERS

Page 34: K means and dbscan

SENSITIVE TO PARAMETERS

Page 35: K means and dbscan

APPLICATIONS

Page 36: K means and dbscan

DISTANCE METRIC FOR DOCUMENTS

• Motivations

• Identical – easy

• Modified or related (Ex: DNA, Plagiarism, Authorship)

• Did Francis Bacon write Shakespeare’s plays

Page 37: K means and dbscan

DOCUMENT RETRIEVAL

Page 38: K means and dbscan

CHALLENGES

• How do we measure similarity

• How do we search over articles

Page 39: K means and dbscan

DOCUMENT REPRESENTATION

• Word count document representation

• Bag of words model

• Ignore order of words

• Count # of instances of each word in vocabulary

Page 40: K means and dbscan

EXAMPLE

• Word: Sequence of alphanumeric characters. For example, the phrase “6.006 is fun” has 4 words.

• Word Frequencies: Word frequency D(w) of a given word w is the number of times it occurs in a document D.

• For example, the words and word frequencies for the above phrase are as below:

Word 6 The Is 006 Easy Fun

Count 1 0 1 1 0 1

Page 41: K means and dbscan

METRIC

• d(x,x) = 0

• d(x,y) = d(y,x)

• d(x,y) + d(y,z) >= d(x,z)

Page 42: K means and dbscan

METRIC

• Inner product of the vectors D1 andD2 containing the word frequencies for all words in the 2 documents. Equivalently, this is the projection of vectors D1 onto D2 or vice versa. Mathematically this is expressed as:

D1 ·D2 = ∑ D1(w) .D2(w)

• Angle Metric: The angle between the vectors D1 and D2 gives an indication of overlap between the 2 documents. Mathematically this angle is expressed as:

θ(D1,D2) = arccos (𝐷1.𝐷2

| 𝐷1 |∗| 𝐷2 |)

Page 43: K means and dbscan

PYTHON EXAMPLE

• https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/docdist2.py

Page 44: K means and dbscan
Page 45: K means and dbscan
Page 46: K means and dbscan
Page 47: K means and dbscan
Page 48: K means and dbscan
Page 49: K means and dbscan
Page 50: K means and dbscan
Page 51: K means and dbscan
Page 52: K means and dbscan
Page 53: K means and dbscan
Page 54: K means and dbscan

REFERENCE

• http://www.cs.haifa.ac.il/~rita/uml_course/lectures/kmeans.pdf

• https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf

• http://www.it.uu.se/edu/course/homepage/infoutv/ht09/a2t.pdf

• http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf

• Machine Learning Specialization by University of Washington in Coursera