Top Banner
Clustering and PCA Recitation Nupur Chatterji, Kenny Marino, Colin White
12

Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...

Jul 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...

Clustering and PCA Recitation

Nupur Chatterji, Kenny Marino, Colin White

Page 2: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...

ClusteringUnsupervised Learning - unlabeled data

● Automatically organize data● Understand structure in data● Preprocessing for further analysis

Page 3: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...
Page 4: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...
Page 5: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...

Euclidean k-means Clustering● Input: a set of n points, x1,x2,...,xn, in Rd , an integer k● Output: k “centers” c1,c2,..., ck

Try to minimize the distance from each

point xi to its closest center

Page 6: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...

K-means complexity● Hard to solve even when k=2 and d=2● k=1 is easy to solve● d=1 is easy to solve (dynamic programming)

Page 7: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...
Page 8: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...

Lloyd’s initializationInitialization is very important for Lloyd’s method

● Random initialization● Farthest-first traversal: iteratively choose farthest point from current set● d2-sampling (k-means++) iteratively choose a point v with probability

dmin(v,C)2, where C is the list of current centers

Theorem: k-means++ always attains an O(log k) approximation to the optimal k-means solution in expectation.

Page 9: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...

K-means runtime● K-means++ initialization O(nkd) time● Lloyd’s method: O(nkd) time

Exponential number of rounds in the worst case

Small number of rounds in practice

Expected number of rounds is polynomial time under smoothed analysis

Page 10: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...

Hierarchical Clustering● What if we don’t know the right value of k? (often the case)

Often leads to natural solutions

Page 11: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...
Page 12: Recitation Clustering and PCA Colin White Nupur Chatterji, Kenny …ninamf/courses/401sp18/recitations... · 2018-04-19 · 3 of 42 Applications (Clustering comes up everywhere...

Runtime:● O(n3) is easy● Can achieve O(n2 log n)