Page 1
Machine Learning
Lecture 12
Unsupervised Learning
Dr. Patrick [email protected]
South China University of Technology, China
1
Dr. Patrick Chan @ SCUT
Agenda
Introduction
Non-Parametric Approach
Similarity Measure
Criterion Function
Clustering Algorithm
Parametric
Gaussian Mixture Model
Lecture 12: Unsupervised Learning2
Dr. Patrick Chan @ SCUT
Supervised VS Unsupervised
Supervised Learning
Label is given
Someone (a supervisor) provides the true answer
Unsupervised Learning
No Label is given
Much harder than supervised learning
Also named “Learning without a teacher”
You never know the true, correct answer
How to evaluate the result?
Lecture 12: Unsupervised Learning3
Dr. Patrick Chan @ SCUT
Unsupervised Learning
How to evaluate the result?
External: Expert comments
Expert may be wrong
Internal: Objective functions
E.g. Distance between samples and centers
Very intuitive
Different from Supervised Learning, the evaluation method is subjective
Lecture 12: Unsupervised Learning4
Page 2
Dr. Patrick Chan @ SCUT
Why Unsupervised Learning?
Label is expensive
Especially for huge dataset
E.g. Medical application
No idea on the number of classes
Data Mining
Gain some insight about the data structurebefore designing classifiers
E.g. Feature selection
Lecture 12: Unsupervised Learning5
Dr. Patrick Chan @ SCUT
Unsupervised Learning Type
Parametric Approach
Assume structure of distribution is known
Only need to estimate parameters of the distribution
E.g. Maximum-Likelihood Estimate
Non-Parametric Approach
No assumption on the distribution
Group data into clusters
Samples in the same group share something in
common
Lecture 12: Unsupervised Learning6
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
What are the characteristic of clusters?
Internal (inter-cluster): Distance within a class should be small
External (intra-cluster): Distance between classes should be large
InternalInternal
External
Lecture 12: Unsupervised Learning7
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Three Important Factors
Similarity (Distance) Measure
How similar between two samples?
Criterion Function
What kind of clustering result is expected?
Clustering Algorithm
E.g. optimize the criterion function
Lecture 12: Unsupervised Learning8
Page 3
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Similarity Measure
No best measure for all cases
Application dependent
Examples:
Face Recognition, rotation invariance
Character recognition, NO rotation invariance
Should be similar
Should be different
Lecture 12: Unsupervised Learning9
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Similarity Measure
Scale of features may be different
Different Ranges Weight: 80 – 300, waist width: 28 – 45
Different Units Km VS mile, cm VS meter
Should features be normalized? Sometimes not!
if the spread is due to the presence of clusters, normalization reduces cluster effect (right diagram)
normalize
Lecture 12: Unsupervised Learning10
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Similarity Measure
Euclidean Distance (L2)
Manhattan Distance (L1)
Cosine Similarity
� � �� ���
�
���
� � �� ���
���
2 4
2
4
� ��� �
� �
2 4
2
4
2 4
2
4
Lecture 12: Unsupervised Learning11
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Similarity Measure
Mahalanobis Distance
Chebyshev Distance
� ��� ��
�
��
�
���2 4
2
4
2 4
2
4
� � ����� �� ��
These two distances are the same
Lecture 12: Unsupervised Learning12
Page 4
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Naïve Clustering Algorithm
A naïve clustering algorithm can be developed only based on similarity measure between samples
Calculate similarity for each sample pair
Group the samples in the same cluster if the measurebetween them is less than a threshold
Advantage:
Easy to understand
Simple to implement
Disadvantage:
Only local information is considered
Highly dependent on the threshold
Lecture 12: Unsupervised Learning13
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Naïve Clustering Algorithm
Large thresholdOne Cluster
Small thresholdEvery sample is a
cluster
Medium thresholdReasonable Result
Lecture 12: Unsupervised Learning14
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Criterion Function
Commonly used criteria
Within-cluster scatter
Variance of each cluster
Between-cluster scatter
Distance between clusters
Combination
Lecture 12: Unsupervised Learning15
� �∈�
�
��
� ��
��
∈�
∈�
� �
c: the number of cluster
n : the number of samples
� : the number of samples in class i
�: the set of all samples
�: the set of samples in class i
* Similar concepts used in LDA
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Criterion Function:
Example:
Within-cluster scatter
2 4
2
4
=
5 − 5.5 � + 4 − 4.5 � + 5 − 5.5 � + 5 − 4.5 �
+ 6 − 5.5 � + 4 − 4.5 � + 6 − 5.5 � + 5 − 4.5 �
+ 1 − 1.8 � + 1 − 1.6 � + 1 − 1.8 � + 2 − 1.6 �
+ 2 − 1.8 � + 1 − 1.6 � + 2 − 1.8 � + 3 − 1.6 �
+ 3 − 1.8 � + 1 − 1.6 �
= 14
[5 4] + [5 5] +[6 4] + [6 5] = [5.5 4. 5]
= 15
[1 1] + [1 2] +[2 1] + [2 3] +[3 1]
= [1.8 1.6] � �
∈�
�
��
�
�
Lecture 12: Unsupervised Learning16
Page 5
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Criterion Function
Smaller ( is preferred
2 4
2
4
2 4
2
4
More reasonable result!
Lecture 12: Unsupervised Learning17
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Criterion Function
Is a good criterion for all situations?
No!
Appropriate when:
The clusters form compact groups
Equally sized clusters
Not Appropriate
When natural groupings have very different sizes
Lecture 12: Unsupervised Learning18
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Criterion Function
Result 1 Result 2
Result 1 is more reasonable
However, it has a larger value of �due to the large cluster
Large Je
Small Je
Lecture 12: Unsupervised Learning19
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Clustering Algorithm
Find the optimal clustering result
Exhaustive search is impossible
Cn/c! possible partitions
Methods:
Hierarchical Clustering
Bottom Up Approach
Top Down Approach
Iterative Optimization Algorithm
K-means
Lecture 12: Unsupervised Learning20
Page 6
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Hierarchical Clustering
In some applications, clusters may havesubclusters and so on
i.e. Hierarchical cluster
Taxonomy is an example
Lecture 12: Unsupervised Learning21
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Hierarchical Clustering
Two types:
Top Down Approach
Start with 1 cluster
One cluster contains all samples
Form hierarchy by splitting the most dissimilar clusters
Bottom Up Approach
Start with n clusters
Each cluster contains one sample
Form hierarchy by merging the most similar clusters
Not efficient if a large number of samples but a number of clusters is needed
Lecture 12: Unsupervised Learning22
Dr. Patrick Chan @ SCUT
Three Important Factors: Algorithm: Hierarchical Clustering
Top Down Approach
Start from one cluster
For each cluster with more than one sample
Break down a cluster into two
Any Iterative Optimization Algorithm can be applied by setting c = 2
Lecture 12: Unsupervised Learning23
Dr. Patrick Chan @ SCUT
Three Important Factors: Algorithm: Hierarchical Clustering
Bottom Up Approach
Initially each sample forms a cluster
Do until more than one cluster
Calculate distance between two clusters for all cluster pairs
Merge the nearest two clusters
Common cluster distance measures
Maximum Distance
Minimum Distance
Average Distance
Mean Distance
!"� � , �$ = min∈� , (∈�)* − +
!", � , �$ = max∈� , (∈�)* − +
!" ,� � , �$ = / − /
!,01 � , �$ = 1��$
2 2 * − +∈�)∈�
Lecture 12: Unsupervised Learning24
Page 7
Dr. Patrick Chan @ SCUT
Three Important Factors: Algorithm: Hierarchical Clustering
Bottom Up Approach
Single Linkage (Nearest-Neighbor)
Minimum Distance is used
Encourage growth of elongated clusters
Disadvantage:
Noise data
Ideal case
12
3
4 56
Cluster Similarity
Min distance between points of each cluster
Sensitive to noise
Lecture 12: Unsupervised Learning25
Dr. Patrick Chan @ SCUT
Three Important Factors: Algorithm: Hierarchical Clustering
Bottom Up Approach
Complete Linkage (Farthest Neighbor)
Maximum Distance is used
Encourages compact clusters
Disadvantage:
12
3
4
5
6
However, C1 and C2 will be merged
Ideally, C2 and C3 should be merged
C1
C2
C3
C2vs C
3
C1vs C
2
C1vs C
3Cluster Similarity
Max distance between points of each cluster
Does not work wellif elongated clusters present
Lecture 12: Unsupervised Learning26
Dr. Patrick Chan @ SCUT
Three Important Factors: Algorithm: Hierarchical Clustering
Bottom Up Approach
Minimum and maximum distance are noise sensitive (especially, minimum)
More robust result to outlier when the average or the mean distance are used
Mean distance is less time consumed than Average distance
Lecture 12: Unsupervised Learning27
" ,� $
,01 $ $ ∈�)∈�
Dr. Patrick Chan @ SCUT
Non-parametric Clustering: Hierarchical Clustering
Venn
Venn diagram can show hierarchical clustering
However, no quantitative information is provided
Sample points Venn Diagram
Lecture 12: Unsupervised Learning28
Page 8
Dr. Patrick Chan @ SCUT
Non-parametric Clustering: Hierarchical Clustering
Dendogram
Dendogram is another way to represent a hierarchical clustering
Binary Tree
Indicate the similarity value
Sample points Dendogram
Similarity x1
& (x2 x3)
One Cluster
Similarity x6 & x7
Two Clusters
Lecture 12: Unsupervised Learning29
Dr. Patrick Chan @ SCUT
Non-parametric Clustering
Iterative Optimization Algorithm
Find a reasonable initial partition
Repeat until stable
Move sample(s) from one cluster to another such that the objective function is improved the most
2 4
2
4
2 4
2
4
Je = 10.29 Je = 8.11
Move [2 3]
from cluster x
to cluster o
Lecture 12: Unsupervised Learning30
Dr. Patrick Chan @ SCUT
Non-parametric Clustering: Iterative Optimization Algorithm
K-means
A well-known technique: K-means
Criterion Function:
Assume there are k clusters
We use k=3 in the following example
�∈�
�
��
Lecture 12: Unsupervised Learning31
Dr. Patrick Chan @ SCUT
Non-parametric Clustering: Iterative Optimization Algorithm
K-means
We use k=3 in the following example
1. Initialization
Randomly assign the
center of each cluster
3. Re-calculate mean
Compute the new means
using new samples
2. Assign Samples
Assign samples to
closest center
Repeat until stable (no sample moves again)Lecture 12: Unsupervised Learning32
Page 9
Dr. Patrick Chan @ SCUT
1st
step
2nd
step
3rd
step
Centers are
changed
Centers are
not changed
K-means
Stops
Centers are
changed
Non-parametric Clustering: Iterative Optimization Algorithm
K-means
Lecture 12: Unsupervised Learning33
Dr. Patrick Chan @ SCUT
Non-parametric Clustering: Iterative Optimization Algorithm
K-means
Decreases the objective function efficiently
Algorithm converges
Drawback:
Trapped at Local Minimum Global Minimum
may be trapped at local minimum (similar to gradient descent)
Lecture 12: Unsupervised Learning34
Dr. Patrick Chan @ SCUT
Parametric Clustering
Different from non-parametric clustering, assumption is made on the distribution
Estimate the parameters of the assumed distribution
Similar to Maximum likelihood method in Supervised Learning
Lecture 12: Unsupervised Learning35
3 345 54
6 64
�
Dr. Patrick Chan @ SCUT
Parametric Clustering
Maximum Likelihood
Assume samples in each class follow a normal distribution
Using samples to estimate and
7�"8 9�"8
7�"8 9�"8
Separate into
c sub-problems
based on
class label
One Gaussian
for each class
(each problem)
Find out the best Gaussian
function for the samples in a class
Lecture 12: Unsupervised Learning36
Page 10
Dr. Patrick Chan @ SCUT
Parametric Clustering
How about no label information is provided?
Assume samples in a cluster follows Gaussian Distribution
However, all samples as a whole should be considered as no label information
Gaussian Mixture Model
How we can determine and ?
7�"8 9�"8
7�"8 9�"8
Lecture 12: Unsupervised Learning37
Dr. Patrick Chan @ SCUT
Parametric Clustering
Gaussian Mixture Model
Multivariate Gaussian Distribution
c : a cluster
d : the number of features
Lecture 12: Unsupervised Learning38
$ ,$ � ,�,$ ,�
:
���
:
$��,$ ,�
:
���
:
$��
$ ,$ � ,�,$ ,�
:
���
:
$��,$ ,�
:
���
:
$��
Dr. Patrick Chan @ SCUT
Parametric Clustering
Gaussian Mixture Model
Maximize Likelihood for the class i
Our aim is to maximize for all classes
Lecture 12: Unsupervised Learning39
= ; <(*|?)
∈�
∝ 2 ln <(*|?)
∈�
<(�|?)
= 2 − 12 2 2 *$ − 7,$ *� − 7,�
9,$ 9,�
:
���
:
$��− 1
2 2 2 ln 9,$ 9,� :
���
:
$��
∈�
CD = 2 2 − 12 2 2 *$ − 7,$ *� − 7,�
9,$ 9,�
:
���
:
$��− 1
2 2 2 ln 9,$ 9,� :
���
:
$��
∈�
�
��
by i.i.d
Monotonicity increasing
Dr. Patrick Chan @ SCUT
Parametric Clustering
Gaussian Mixture Model
How to maximize J?
If the label is known (in supervised learning), and can be calculated directly
Lecture 12: Unsupervised Learning40
D$ ,$ � ,�
,$ ,�
:
���
:
$��,$ ,�
:
���
:
$��
∈�
�
��
,$� $ ,$
�
∈� ,$ $
∈�
Page 11
Dr. Patrick Chan @ SCUT
Parametric Clustering
Gaussian Mixture Model
For unsupervised problem, the class(cluster) of a sample is unknown
and cannot be calculated directly
Gradient Decent can be used
Similar to K-mean
Lecture 12: Unsupervised Learning41
Dr. Patrick Chan @ SCUT
Parametric Clustering
Gaussian Mixture Model
Randomly initial and for each class
Repeat until stable
For each x, calculate likelihood (p(x|c)) for all c
x is assigned to the cluster c with the largest
likelihood
Maybe more than one results due to random initialization
Lecture 12: Unsupervised Learning42
Dr. Patrick Chan @ SCUT
Parametric Clustering
Gaussian Mixture Model
Example: iis diagonal
Features are statistically independent and has the different variance
Lecture 12: Unsupervised Learning43
i
1
2
d
2
Dr. Patrick Chan @ SCUT
Parametric Clustering
Gaussian Mixture Model
Lecture 12: Unsupervised Learning44
-3.7 -0.4
0.4 0.1
0.4 -1.7
-0.4 -1
-1.3 -1.7
1 3.3
1.2 5.2
1.3 0.3
1.1 -0.8
0.5 2.8
-7.3 -7.8
0.0 0.0
-1.7 -1.8
-0.7 -0.9
-2.6 -2.9
-5.5 -5.1
-13.6 -13.0
-0.7 -0.6
-0.9 -0.9
-3.7 -3.4
1
2
1
1
1
2
2
2
2
2
-1.3 -1.2 0.9 1.8
3.2 0.4 0.1 5.3
0.1 0.1 0.2 0.2
1 1 1 1
-2.6 -530.8
-5.8 -6.4
-1.2 -6.6
-0.4 -43.1
-1.0 -122.4
-65.9 0.1
-132.9 -1.9
-7.8 -3.4
-1.0 -0.7
-52.1 -4.1
-0.9 -0.9 1.0 2.2
2.9 0.6 0.1 5.8
-1.4 -1183.4
-2.1 -19.9
-1.4 -20.1
-0.6 -106.7
-1.3 -285.7
-23.3 0.5
-47.9 -1.3
-2.8 -3.6
-0.9 0.1
-18.2 -13.8
-0.6 -0.7 1.0 2.6
3.1 0.8 0.1 6.3
-0.6 -0.7 1.0 2.6
3.1 0.8 0.1 6.3
1
1
1
1
1
2
2
2
2
2
1
1
1
1
1
2
2
1
2
2
1
1
1
1
1
2
2
1
2
2
< * E = F = − 12 2 *$ − 7,$
�
9,$ �
:
$��− 1
2 2 ln 9,$ �:
$��9G,$
� = 1�
2 *$ − 7̂,$�
∈� 7̂,$ = 1
�2 *$
∈� x1 x2
-1.5 -1156.6
-1.5 -15.8
-1.7 -15.9
-0.9 -97.2
-1.7 -270.6
-15.0 0.4
-31.3 -2.9
-2.0 -6.1
-1.0 -0.9
-11.7 -10.3
c1c27̂
9G�
c1c2 c1
c2 c1c2 c1
c2
< * E = 1 < * E = 2 < * E = 1 < * E = 2 < * E = 1 < * E = 2 < * E = 1 < * E = 2
Page 12
Dr. Patrick Chan @ SCUT
Parametric Clustering
Gaussian Mixture Model
When i= 2 I for all class,
GMM is equal to the resultof using within-cluster scatter
Objective function can be simplified
Vector Form
Lecture 12: Unsupervised Learning45
i
2
2
CD = 2 2 − 12 2 2 *$ − 7,$ *� − 7,�
9,$ 9,�
:
���
:
$��− 1
2 2 2 ln 9,$ 9,� :
���
:
$��
∈�
�
��
= 2 2 − 12 2(*$ − 7,$)�
:
$��
∈�
�
��
D �
∈�
�
��� �
∈�
�
��
Dr. Patrick Chan @ SCUT
Parametric Clustering
Gaussian Mixture Model
Distribution is considered
Pro: More stable than k-mean generally
Less sensitive to noisy samples
Con: Misleading if not Gaussian distribution
These 4 datasets contains the same mean and variance
Lecture 12: Unsupervised Learning46
Dr. Patrick Chan @ SCUT
Clustering: Number of Clusters
How to decide the number of clusters?
Possible solution:
Try a range of c and see which one has the
lowest criterion value
Lecture 12: Unsupervised Learning47
Dr. Patrick Chan @ SCUT
References
http://users.umiacs.umd.edu/~jbg/teaching/INST_414/
Lecture 12: Unsupervised Learning48