Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Machine Learning

Lecture 12

Unsupervised Learning

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Introduction

Non-Parametric Approach

Similarity Measure

Criterion Function

Clustering Algorithm

Parametric

Gaussian Mixture Model

Lecture 12: Unsupervised Learning2


Supervised VS Unsupervised

Supervised Learning

Label is given

Someone (a supervisor) provides the true answer


No Label is given

Much harder than supervised learning

Also named “Learning without a teacher”

You never know the true, correct answer

How to evaluate the result?




How to evaluate the result?

External: Expert comments

Expert may be wrong

Internal: Objective functions

E.g. Distance between samples and centers

Very intuitive

Different from Supervised Learning, the evaluation method is subjective



Why Unsupervised Learning?

Label is expensive

Especially for huge dataset

E.g. Medical application

No idea on the number of classes

Data Mining

Gain some insight about the data structurebefore designing classifiers

E.g. Feature selection



Unsupervised Learning Type

Parametric Approach

Assume structure of distribution is known

Only need to estimate parameters of the distribution

E.g. Maximum-Likelihood Estimate

Non-Parametric Approach

No assumption on the distribution

Group data into clusters

Samples in the same group share something in

common



Non-parametric Clustering

What are the characteristic of clusters?

Internal (inter-cluster): Distance within a class should be small

External (intra-cluster): Distance between classes should be large

InternalInternal

External




Three Important Factors

Similarity (Distance) Measure

How similar between two samples?

Criterion Function

What kind of clustering result is expected?


E.g. optimize the criterion function




Similarity Measure

No best measure for all cases

Application dependent

Examples:

Face Recognition, rotation invariance

Character recognition, NO rotation invariance

Should be similar

Should be different




Similarity Measure

Scale of features may be different

Different Ranges Weight: 80 – 300, waist width: 28 – 45

Different Units Km VS mile, cm VS meter

Should features be normalized? Sometimes not!

if the spread is due to the presence of clusters, normalization reduces cluster effect (right diagram)

normalize




Similarity Measure

Euclidean Distance (L2)

Manhattan Distance (L1)

Cosine Similarity

� � ��

�

��

� � ��

��

2 4

2

4

� ��

� �

2 4

2

4

2 4

2

4




Similarity Measure

Mahalanobis Distance

Chebyshev Distance

� ��

�

��

�

��2 4

2

4

2 4

2

4

� � ��

These two distances are the same




Naïve Clustering Algorithm

A naïve clustering algorithm can be developed only based on similarity measure between samples

Calculate similarity for each sample pair

Group the samples in the same cluster if the measurebetween them is less than a threshold

Advantage:

Easy to understand

Simple to implement

Disadvantage:

Only local information is considered

Highly dependent on the threshold




Naïve Clustering Algorithm

Large thresholdOne Cluster

Small thresholdEvery sample is a

cluster

Medium thresholdReasonable Result




Criterion Function

Commonly used criteria

Within-cluster scatter

Variance of each cluster

Between-cluster scatter

Distance between clusters

Combination


� �∈�

�

��

� ��

��

∈�

∈�

� �

c: the number of cluster

n : the number of samples

� : the number of samples in class i

�: the set of all samples

�: the set of samples in class i

* Similar concepts used in LDA



Criterion Function:

Example:

Within-cluster scatter

2 4

2

4

=

5 − 5.5 � + 4 − 4.5 � + 5 − 5.5 � + 5 − 4.5 �

+ 6 − 5.5 � + 4 − 4.5 � + 6 − 5.5 � + 5 − 4.5 �

+ 1 − 1.8 � + 1 − 1.6 � + 1 − 1.8 � + 2 − 1.6 �

+ 2 − 1.8 � + 1 − 1.6 � + 2 − 1.8 � + 3 − 1.6 �

+ 3 − 1.8 � + 1 − 1.6 �

= 14

[5 4] + [5 5] +[6 4] + [6 5] = [5.5 4. 5]

= 15

[1 1] + [1 2] +[2 1] + [2 3] +[3 1]

= [1.8 1.6] � �

∈�

�

��

�

�




Criterion Function

Smaller ( is preferred

2 4

2

4

2 4

2

4

More reasonable result!




Criterion Function

Is a good criterion for all situations?

No!

Appropriate when:

The clusters form compact groups

Equally sized clusters

Not Appropriate

When natural groupings have very different sizes




Criterion Function

Result 1 Result 2

Result 1 is more reasonable

However, it has a larger value of �due to the large cluster

Large Je

Small Je





Find the optimal clustering result

Exhaustive search is impossible

Cn/c! possible partitions

Methods:

Hierarchical Clustering

Bottom Up Approach

Top Down Approach

Iterative Optimization Algorithm

K-means





In some applications, clusters may havesubclusters and so on

i.e. Hierarchical cluster

Taxonomy is an example





Two types:

Top Down Approach

Start with 1 cluster

One cluster contains all samples

Form hierarchy by splitting the most dissimilar clusters

Bottom Up Approach

Start with n clusters

Each cluster contains one sample

Form hierarchy by merging the most similar clusters

Not efficient if a large number of samples but a number of clusters is needed



Three Important Factors: Algorithm: Hierarchical Clustering

Top Down Approach

Start from one cluster

For each cluster with more than one sample

Break down a cluster into two

Any Iterative Optimization Algorithm can be applied by setting c = 2




Bottom Up Approach

Initially each sample forms a cluster

Do until more than one cluster

Calculate distance between two clusters for all cluster pairs

Merge the nearest two clusters

Common cluster distance measures

Maximum Distance

Minimum Distance

Average Distance

Mean Distance

!"� � , �$ = min∈� , (∈�)* − +

!", � , �$ = max∈� , (∈�)* − +

!" ,� � , �$ = / − /

!,01 � , �$ = 1��$

2 2 * − +∈�)∈�




Bottom Up Approach

Single Linkage (Nearest-Neighbor)

Minimum Distance is used

Encourage growth of elongated clusters

Disadvantage:

Noise data

Ideal case

12

3

4 56

Cluster Similarity

Min distance between points of each cluster

Sensitive to noise




Bottom Up Approach

Complete Linkage (Farthest Neighbor)

Maximum Distance is used

Encourages compact clusters

Disadvantage:

12

3

4

5

6

However, C1 and C2 will be merged

Ideally, C2 and C3 should be merged

C1

C2

C3

C2vs C

3

C1vs C

2

C1vs C

3Cluster Similarity

Max distance between points of each cluster

Does not work wellif elongated clusters present




Bottom Up Approach

Minimum and maximum distance are noise sensitive (especially, minimum)

More robust result to outlier when the average or the mean distance are used

Mean distance is less time consumed than Average distance


" ,� $

,01 $ $ ∈�)∈�


Non-parametric Clustering: Hierarchical Clustering

Venn

Venn diagram can show hierarchical clustering

However, no quantitative information is provided

Sample points Venn Diagram



Non-parametric Clustering: Hierarchical Clustering

Dendogram

Dendogram is another way to represent a hierarchical clustering

Binary Tree

Indicate the similarity value

Sample points Dendogram

Similarity x1

& (x2 x3)

One Cluster

Similarity x6 & x7

Two Clusters




Iterative Optimization Algorithm

Find a reasonable initial partition

Repeat until stable

Move sample(s) from one cluster to another such that the objective function is improved the most

2 4

2

4

2 4

2

4

Je = 10.29 Je = 8.11

Move [2 3]

from cluster x

to cluster o



Non-parametric Clustering: Iterative Optimization Algorithm

K-means

A well-known technique: K-means

Criterion Function:

Assume there are k clusters

We use k=3 in the following example

�∈�

�

��




K-means

We use k=3 in the following example

1. Initialization

Randomly assign the

center of each cluster

3. Re-calculate mean

Compute the new means

using new samples

2. Assign Samples

Assign samples to

closest center

Repeat until stable (no sample moves again)Lecture 12: Unsupervised Learning32


1st

step

2nd

step

3rd

step

Centers are

changed

Centers are

not changed

K-means

Stops

Centers are

changed


K-means




K-means

Decreases the objective function efficiently

Algorithm converges

Drawback:

Trapped at Local Minimum Global Minimum

may be trapped at local minimum (similar to gradient descent)



Parametric Clustering

Different from non-parametric clustering, assumption is made on the distribution

Estimate the parameters of the assumed distribution

Similar to Maximum likelihood method in Supervised Learning


3 345 54

6 64

�



Maximum Likelihood

Assume samples in each class follow a normal distribution

Using samples to estimate and

7�"8 9�"8

7�"8 9�"8

Separate into

c sub-problems

based on

class label

One Gaussian

for each class

(each problem)

Find out the best Gaussian

function for the samples in a class




How about no label information is provided?

Assume samples in a cluster follows Gaussian Distribution

However, all samples as a whole should be considered as no label information


How we can determine and ?

7�"8 9�"8

7�"8 9�"8





Multivariate Gaussian Distribution

c : a cluster

d : the number of features


$ ,$ � ,�,$ ,�

:

��

:

$��,$ ,�

:

��

:

$��

$ ,$ � ,�,$ ,�

:

��

:

$��,$ ,�

:

��

:

$��




Maximize Likelihood for the class i

Our aim is to maximize for all classes


= ; <(*|?)

∈�

∝ 2 ln <(*|?)

∈�

<(�|?)

= 2 − 12 2 2 *$ − 7,$ *� − 7,�

9,$ 9,�

:

��

:

$��− 1

2 2 2 ln 9,$ 9,� :

��

:

$��

∈�

CD = 2 2 − 12 2 2 *$ − 7,$ *� − 7,�

9,$ 9,�

:

��

:

$��− 1

2 2 2 ln 9,$ 9,� :

��

:

$��

∈�

�

��

by i.i.d

Monotonicity increasing




How to maximize J?

If the label is known (in supervised learning), and can be calculated directly


D$ ,$ � ,�

,$ ,�

:

��

:

$��,$ ,�

:

��

:

$��

∈�

�

��

,$� $ ,$

�

∈� ,$ $

∈�




For unsupervised problem, the class(cluster) of a sample is unknown

and cannot be calculated directly

Gradient Decent can be used

Similar to K-mean





Randomly initial and for each class

Repeat until stable

For each x, calculate likelihood (p(x|c)) for all c

x is assigned to the cluster c with the largest

likelihood

Maybe more than one results due to random initialization





Example: iis diagonal

Features are statistically independent and has the different variance


i

1

2

d

2





-3.7 -0.4

0.4 0.1

0.4 -1.7

-0.4 -1

-1.3 -1.7

1 3.3

1.2 5.2

1.3 0.3

1.1 -0.8

0.5 2.8

-7.3 -7.8

0.0 0.0

-1.7 -1.8

-0.7 -0.9

-2.6 -2.9

-5.5 -5.1

-13.6 -13.0

-0.7 -0.6

-0.9 -0.9

-3.7 -3.4

1

2

1

1

1

2

2

2

2

2

-1.3 -1.2 0.9 1.8

3.2 0.4 0.1 5.3

0.1 0.1 0.2 0.2

1 1 1 1

-2.6 -530.8

-5.8 -6.4

-1.2 -6.6

-0.4 -43.1

-1.0 -122.4

-65.9 0.1

-132.9 -1.9

-7.8 -3.4

-1.0 -0.7

-52.1 -4.1

-0.9 -0.9 1.0 2.2

2.9 0.6 0.1 5.8

-1.4 -1183.4

-2.1 -19.9

-1.4 -20.1

-0.6 -106.7

-1.3 -285.7

-23.3 0.5

-47.9 -1.3

-2.8 -3.6

-0.9 0.1

-18.2 -13.8

-0.6 -0.7 1.0 2.6

3.1 0.8 0.1 6.3

-0.6 -0.7 1.0 2.6

3.1 0.8 0.1 6.3

1

1

1

1

1

2

2

2

2

2

1

1

1

1

1

2

2

1

2

2

1

1

1

1

1

2

2

1

2

2

< * E = F = − 12 2 *$ − 7,$

�

9,$ �

:

$��− 1

2 2 ln 9,$ �:

$��9G,$

� = 1�

2 *$ − 7̂,$�

∈� 7̂,$ = 1

�2 *$

∈� x1 x2

-1.5 -1156.6

-1.5 -15.8

-1.7 -15.9

-0.9 -97.2

-1.7 -270.6

-15.0 0.4

-31.3 -2.9

-2.0 -6.1

-1.0 -0.9

-11.7 -10.3

c1c27̂

9G�

c1c2 c1

c2 c1c2 c1

c2

< * E = 1 < * E = 2 < * E = 1 < * E = 2 < * E = 1 < * E = 2 < * E = 1 < * E = 2




When i= 2 I for all class,

GMM is equal to the resultof using within-cluster scatter

Objective function can be simplified

Vector Form


i

2

2

CD = 2 2 − 12 2 2 *$ − 7,$ *� − 7,�

9,$ 9,�

:

��

:

$��− 1

2 2 2 ln 9,$ 9,� :

��

:

$��

∈�

�

��

= 2 2 − 12 2(*$ − 7,$)�

:

$��

∈�

�

��

D �

∈�

�

��

∈�

�

��




Distribution is considered

Pro: More stable than k-mean generally

Less sensitive to noisy samples

Con: Misleading if not Gaussian distribution

These 4 datasets contains the same mean and variance



Clustering: Number of Clusters

How to decide the number of clusters?

Possible solution:

Try a range of c and see which one has the

lowest criterion value



References

http://users.umiacs.umd.edu/~jbg/teaching/INST_414/


Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Documents