12. Clustering Foundations of Machine Learning CentraleSupélec — Fall 2016 Benoît Playe, Chloé-Agathe Azencott Centre for Computational Biology, Mines ParisTech [email protected]
12. Clustering
Foundations of Machine LearningCentraleSupélec — Fall 2016
Benoît Playe, Chloé-Agathe AzencottCentre for Computational Biology, Mines
ParisTechchloeagathe.azencott@minesparistech.fr
2
Learning objectives● Explain what clustering algorithms can
be used for.● Explain and implement three different
ways to evaluate clustering algorithms.
● Implement hierarchical clustering, discuss its various flavors.
● Implement k-means clustering, discuss its advantages and drawbacks.
3
Goals of clustering
Group objects that are similar into clusters: classes that are unknown beforehand.
4
Goals of clustering
Group objects that are similar into clusters: classes that are unknown beforehand.
5
Goals of clustering
Group objects that are similar into clusters: classes that are unknown beforehand.
E.g. – group genes that are similarly affected by a
disease
– group people who share the same interest (marketing purposes)
– group pixels in an image that belong to the same object (image segmentation).
6
Applications of clustering● Understand general characteristics of the data● Visualize the data● Infer some properties of a data point based on
how it relates to other data points
E.g.– find subtypes of diseases– visualize protein families– find categories among images– find patterns in financial transactions– detect communities in social networks
7
Centroids and medoids● Centroid: mean of the points in the cluster.
● Medoid: point in the cluster that is closest to the centroid.
9
For a given mapping
from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product of their images in the feature spaces.
Similarities● Kernels define similarities
10
Distances● Assess how close / far
– data points are from each other
– a data point is from a cluster
– two clusters are from each other
11
Distances● Assess how close / far
– data points are from each other
– a data point is from a cluster
– two clusters are from each other
● Distance metric
symmetry
triangle inequality
13
Distance & similarities● Transform distances into similarities?
● Transform similarities into distances?
Generalization:
14
Distances● L-q norm:
● Pearson's correlation
Measure of the linear correlation between two variables
16
Pearson's correlationWhat does it correspond to if the features are centered?
Normalized dot product = cosine
featu
re 2
feature 1
● What's the max value of ρ? What does it mean?
● What's the min value of ρ? What does it mean?
17
Pearson's correlationWhat does it correspond to if the features are centered?
Normalized dot product = cosine
featu
re 2
feature 1
● Max value of ρ: 1
There's a positive linear relationship between the features of x and the features of z.
● Min value of ρ: -1
There's a negative linear relationship between the features of x and the features of z.
18
Pearson vs Euclide● Pearson's coefficient
Profiles of similar shapes will be close to each other, even if they differ in magnitude.
● Euclidean distance
Magnitude is taken into account.
21
Evaluating clusters● Clustering is unsupervised. There is no
ground truth.● How do we evaluate the quality of a
clustering algorithm?
22
Evaluating clusters● Clustering is unsupervised. There is no
ground truth.● How do we evaluate the quality of a
clustering algorithm?1)Based on the shape of the clusters:
Points within the same cluster should be nearby/similar and points far from each other should belong to different clusters.
2)Based on the stability of the clusters:
We should get the same results if we remove some data points, add noise, etc.
3)Based on domain knowledge:
The clusters should “make sense”.
23
Clusters shape● Cluster tightness (homogeneity)
● Cluster separation
● Davies-Bouldin index
Tk
Skl
24
Clusters shape● Silhouette coefficient
a(x) = average dissimilarity of x with the other points in the same cluster
b(x) = lowest average dissimilarity of x with any other point
– If x is very close to other points in the same cluster, and very different from points in other cluster, s(x) ≈ … ?
– If x is assigned to the wrong cluster, s(x) ≈ ... ?
25
Clusters shape● Silhouette coefficient
a(x) = average dissimilarity of x with the other points in the same cluster
b(x) = lowest average dissimilarity of x with any other point
– If x is very close to other points in the same cluster, and very different from points in other cluster, s(x) ≈ +1.
– If x is assigned to the wrong cluster, s(x) ≈ -1.
28
Cluster stability
● k=4
● k=2 ● k=5
● Several algorithms / runs of the same algorithm might give you different answers
29
Cluster stability
● k=4
● k=2 ● k=5
● Several algorithms / runs of the same algorithm might give you different answers
30
Cluster stability
● Measuring cluster stability● Generate perturbed versions of the original dataset
(for example by sub-sampling or adding noise).● Cluster the data set with the desired algorithm into k
cluster.● Instability measure:
● One can choose the number of cluster which minimizes the instability measure
32
Domain knowledge: enrichment analysis
● Example: Ontology
Entities may be grouped, related within a hierarchy, and subdivided according to similarities and differences.
Build by human experts
● E.g.: The Gene Ontologyhttp://geneontology.org/
– Describe genes with a common vocabulary, organized in categories
E.g. cellular process > cell death > programmed cell death > apoptotic process > execution phase of apoptosis
33
Ontology enrichment analysis
● Enrichment analysis:
Are there more data points from ontology category G in cluster C than expected by chance?
● TANGO [Tanay et al., 2003]
– Assume data points sampled from a hypergeometric distribution
34
Ontology enrichment analysis
● Enrichment analysis:
Are there more data points from ontology category G in cluster C than expected by chance?
● TANGO [Tanay et al., 2003]
– Assume data points sampled from a hypergeometric distribution
– The probability for the intersection of G and C to contain more than t points is:
Probability of getting i points from G when drawing |C| points from a total of n samples.
37
Construction● Agglomerative approach (bottom-up)
Start with each element in its own cluster
Iteratively join neighboring clusters.
● Divisive approach (top-down)
Start with all elements in the same cluster
Iteratively separate into smaller clusters.
38
Dendogram● The results of a hierarchical clustering
algorithm are presented in a dendogram.● Branch length = cluster distance.
How many clusters do I have?
39
Dendogram● The results of a hierarchical clustering
algorithm are presented in a dendogram.● Branch length = cluster distance.
How many clusters do I have?
21 3 4
41
LinkageHow do we decide how to connect/split two clusters?
● Average linkage or UPGMA
Unweighted Paired Group Method with Arithmetic mean
● Centroid linkage or UPGMC– Unweighted Paired Group Method using
Centroids
44
Example: Gene expression clustering
Breast cancer survival signature
[Bergamashi et al. 2011]
gen
es
patients1 2
2
1
45
LinkageHow do we decide how to connect/split two clusters?
● Ward
Join clusters so as to minimize within-cluster variance
46
Hierarchical clustering● Advantages
– No need to pre-define the number of clusters
– Interpretability
● Drawbacks– Computational complexity
What is the computational complexity of hierarchical clustering?
47
Hierarchical clustering● Advantages
– No need to pre-define the number of clusters– Interpretability
● Drawbacks– Computational complexity:
E.g. Single/complete linkage (naive):
At least O(pn²) to compute all pairwise distances.
– Must decide at which level of the hierarchy to split
– Lack of robustness (unstable)
49
K-means clustering● Minimize the intra-cluster variance
● What will this partition of the space look like?
51
Lloyd's algorithm● K-means cannot be easily optimized● We adopt a greedy strategy.
– Partition the data into K clusters at random
– Compute the centroid of each cluster
– Assign each point to the cluster whose centroid it is closest to
– Repeat until cluster membership converges.
54
K-means● Advantages
– Computational time:
– Easily implementable
● Drawbacks– Need to set up K ahead of time
– What happens when there are outliers?
number of iterations
compute kn distancesin p dimensions (Can be small if there's
indeed a cluster structure in the data)
55
K-means● Advantages
– Computational time is linear – Easily implementable
● Drawbacks– Need to set up K ahead of time– Sensitive to noise and outliers– Stochastic (different solutions with each
iteration)– The clusters are forced to have spherical
shapes
56
K-means variants● K-means++
– Seeding algorithm to initialize clusters with centroids “spread-out” throughout the data.
– Deterministic● K-medoids● Kernel k-means
Find clusters in feature space
k-means kernel k-means
57
More clustering approaches● Soft clustering
Each point gets a probability of belonging to each cluster.
● Disjunctive clustering
Each point can belong to multiple clusters.
● Density-based clustering
Look for dense regions of the space.
E.g. DBSCAN
58
Summary● Clustering: unsupervised approach to group
similar data points together.● Evaluate clustering algorithms based on
– the shape of the cluster– the stability of the results– the consistency with domain knowledge.
● Hierarchical clustering– top-down / bottom-up– various linkage functions.
● k-means clustering.
59
Exam: Fri, Dec 16 8am‒11am
● No documents, no calculators, no computer.● Theoretical, technical, and practical questions ● Exam from last year online ● Short answers!● How to study
– Homework + last year's exam
– Labs
– Answer the questions on the slides.
● Formulas– To know: Bayes, how to compute derivatives.
– Everything else will be given. Interpretation is key.
60
challenge project
● Detailed instructions on the course website http://cazencott.info/dotclear/public/lectures/ma2823_2016/kaggle-project.pdf
● Deadline for submissions & for the report:
Fri, Dec 16 at 23:59● Report: Practical instructions:– PDF document
● No more than 2 pages● Lastname1_Lastname2.pdf
– Starts with● Full names● Kaggle user names● Kaggle team names.
By email to all three of us!