Top Banner
Machine Learning Lecture 12 Unsupervised Learning Dr. Patrick Chan [email protected] South China University of Technology, China 1 Dr. Patrick Chan @ SCUT Agenda Introduction Non-Parametric Approach Similarity Measure Criterion Function Clustering Algorithm Parametric Gaussian Mixture Model Lecture 12: Unsupervised Learning 2 Dr. Patrick Chan @ SCUT Supervised VS Unsupervised Supervised Learning Label is given Someone (a supervisor) provides the true answer Unsupervised Learning No Label is given Much harder than supervised learning Also named “Learning without a teacher You never know the true, correct answer How to evaluate the result? Lecture 12: Unsupervised Learning 3 Dr. Patrick Chan @ SCUT Unsupervised Learning How to evaluate the result? External: Expert comments Expert may be wrong Internal: Objective functions E.g. Distance between samples and centers Very intuitive Different from Supervised Learning, the evaluation method is subjective Lecture 12: Unsupervised Learning 4
12

Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Jul 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Machine Learning

Lecture 12

Unsupervised Learning

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Introduction

Non-Parametric Approach

Similarity Measure

Criterion Function

Clustering Algorithm

Parametric

Gaussian Mixture Model

Lecture 12: Unsupervised Learning2

Dr. Patrick Chan @ SCUT

Supervised VS Unsupervised

Supervised Learning

Label is given

Someone (a supervisor) provides the true answer

Unsupervised Learning

No Label is given

Much harder than supervised learning

Also named “Learning without a teacher”

You never know the true, correct answer

How to evaluate the result?

Lecture 12: Unsupervised Learning3

Dr. Patrick Chan @ SCUT

Unsupervised Learning

How to evaluate the result?

External: Expert comments

Expert may be wrong

Internal: Objective functions

E.g. Distance between samples and centers

Very intuitive

Different from Supervised Learning, the evaluation method is subjective

Lecture 12: Unsupervised Learning4

Page 2: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Why Unsupervised Learning?

Label is expensive

Especially for huge dataset

E.g. Medical application

No idea on the number of classes

Data Mining

Gain some insight about the data structurebefore designing classifiers

E.g. Feature selection

Lecture 12: Unsupervised Learning5

Dr. Patrick Chan @ SCUT

Unsupervised Learning Type

Parametric Approach

Assume structure of distribution is known

Only need to estimate parameters of the distribution

E.g. Maximum-Likelihood Estimate

Non-Parametric Approach

No assumption on the distribution

Group data into clusters

Samples in the same group share something in

common

Lecture 12: Unsupervised Learning6

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

What are the characteristic of clusters?

Internal (inter-cluster): Distance within a class should be small

External (intra-cluster): Distance between classes should be large

InternalInternal

External

Lecture 12: Unsupervised Learning7

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Three Important Factors

Similarity (Distance) Measure

How similar between two samples?

Criterion Function

What kind of clustering result is expected?

Clustering Algorithm

E.g. optimize the criterion function

Lecture 12: Unsupervised Learning8

Page 3: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Similarity Measure

No best measure for all cases

Application dependent

Examples:

Face Recognition, rotation invariance

Character recognition, NO rotation invariance

Should be similar

Should be different

Lecture 12: Unsupervised Learning9

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Similarity Measure

Scale of features may be different

Different Ranges Weight: 80 – 300, waist width: 28 – 45

Different Units Km VS mile, cm VS meter

Should features be normalized? Sometimes not!

if the spread is due to the presence of clusters, normalization reduces cluster effect (right diagram)

normalize

Lecture 12: Unsupervised Learning10

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Similarity Measure

Euclidean Distance (L2)

Manhattan Distance (L1)

Cosine Similarity

� � �� ���

���

� � �� ���

���

2 4

2

4

� ��� �

� �

2 4

2

4

2 4

2

4

Lecture 12: Unsupervised Learning11

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Similarity Measure

Mahalanobis Distance

Chebyshev Distance

� ��� ��

��

���2 4

2

4

2 4

2

4

� � ����� �� ��

These two distances are the same

Lecture 12: Unsupervised Learning12

Page 4: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Naïve Clustering Algorithm

A naïve clustering algorithm can be developed only based on similarity measure between samples

Calculate similarity for each sample pair

Group the samples in the same cluster if the measurebetween them is less than a threshold

Advantage:

Easy to understand

Simple to implement

Disadvantage:

Only local information is considered

Highly dependent on the threshold

Lecture 12: Unsupervised Learning13

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Naïve Clustering Algorithm

Large thresholdOne Cluster

Small thresholdEvery sample is a

cluster

Medium thresholdReasonable Result

Lecture 12: Unsupervised Learning14

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Criterion Function

Commonly used criteria

Within-cluster scatter

Variance of each cluster

Between-cluster scatter

Distance between clusters

Combination

Lecture 12: Unsupervised Learning15

� �∈�

��

� ��

��

∈�

∈�

� �

c: the number of cluster

n : the number of samples

� : the number of samples in class i

�: the set of all samples

�: the set of samples in class i

* Similar concepts used in LDA

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Criterion Function:

Example:

Within-cluster scatter

2 4

2

4

=

5 − 5.5 � + 4 − 4.5 � + 5 − 5.5 � + 5 − 4.5 �

+ 6 − 5.5 � + 4 − 4.5 � + 6 − 5.5 � + 5 − 4.5 �

+ 1 − 1.8 � + 1 − 1.6 � + 1 − 1.8 � + 2 − 1.6 �

+ 2 − 1.8 � + 1 − 1.6 � + 2 − 1.8 � + 3 − 1.6 �

+ 3 − 1.8 � + 1 − 1.6 �

= 14

[5 4] + [5 5] +[6 4] + [6 5] = [5.5 4. 5]

= 15

[1 1] + [1 2] +[2 1] + [2 3] +[3 1]

= [1.8 1.6] � �

∈�

��

Lecture 12: Unsupervised Learning16

Page 5: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Criterion Function

Smaller ( is preferred

2 4

2

4

2 4

2

4

More reasonable result!

Lecture 12: Unsupervised Learning17

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Criterion Function

Is a good criterion for all situations?

No!

Appropriate when:

The clusters form compact groups

Equally sized clusters

Not Appropriate

When natural groupings have very different sizes

Lecture 12: Unsupervised Learning18

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Criterion Function

Result 1 Result 2

Result 1 is more reasonable

However, it has a larger value of �due to the large cluster

Large Je

Small Je

Lecture 12: Unsupervised Learning19

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Clustering Algorithm

Find the optimal clustering result

Exhaustive search is impossible

Cn/c! possible partitions

Methods:

Hierarchical Clustering

Bottom Up Approach

Top Down Approach

Iterative Optimization Algorithm

K-means

Lecture 12: Unsupervised Learning20

Page 6: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Hierarchical Clustering

In some applications, clusters may havesubclusters and so on

i.e. Hierarchical cluster

Taxonomy is an example

Lecture 12: Unsupervised Learning21

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Hierarchical Clustering

Two types:

Top Down Approach

Start with 1 cluster

One cluster contains all samples

Form hierarchy by splitting the most dissimilar clusters

Bottom Up Approach

Start with n clusters

Each cluster contains one sample

Form hierarchy by merging the most similar clusters

Not efficient if a large number of samples but a number of clusters is needed

Lecture 12: Unsupervised Learning22

Dr. Patrick Chan @ SCUT

Three Important Factors: Algorithm: Hierarchical Clustering

Top Down Approach

Start from one cluster

For each cluster with more than one sample

Break down a cluster into two

Any Iterative Optimization Algorithm can be applied by setting c = 2

Lecture 12: Unsupervised Learning23

Dr. Patrick Chan @ SCUT

Three Important Factors: Algorithm: Hierarchical Clustering

Bottom Up Approach

Initially each sample forms a cluster

Do until more than one cluster

Calculate distance between two clusters for all cluster pairs

Merge the nearest two clusters

Common cluster distance measures

Maximum Distance

Minimum Distance

Average Distance

Mean Distance

!"� � , �$ = min∈� , (∈�)* − +

!", � , �$ = max∈� , (∈�)* − +

!" ,� � , �$ = / − /

!,01 � , �$ = 1��$

2 2 * − +∈�)∈�

Lecture 12: Unsupervised Learning24

Page 7: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Three Important Factors: Algorithm: Hierarchical Clustering

Bottom Up Approach

Single Linkage (Nearest-Neighbor)

Minimum Distance is used

Encourage growth of elongated clusters

Disadvantage:

Noise data

Ideal case

12

3

4 56

Cluster Similarity

Min distance between points of each cluster

Sensitive to noise

Lecture 12: Unsupervised Learning25

Dr. Patrick Chan @ SCUT

Three Important Factors: Algorithm: Hierarchical Clustering

Bottom Up Approach

Complete Linkage (Farthest Neighbor)

Maximum Distance is used

Encourages compact clusters

Disadvantage:

12

3

4

5

6

However, C1 and C2 will be merged

Ideally, C2 and C3 should be merged

C1

C2

C3

C2vs C

3

C1vs C

2

C1vs C

3Cluster Similarity

Max distance between points of each cluster

Does not work wellif elongated clusters present

Lecture 12: Unsupervised Learning26

Dr. Patrick Chan @ SCUT

Three Important Factors: Algorithm: Hierarchical Clustering

Bottom Up Approach

Minimum and maximum distance are noise sensitive (especially, minimum)

More robust result to outlier when the average or the mean distance are used

Mean distance is less time consumed than Average distance

Lecture 12: Unsupervised Learning27

" ,� $

,01 $ $ ∈�)∈�

Dr. Patrick Chan @ SCUT

Non-parametric Clustering: Hierarchical Clustering

Venn

Venn diagram can show hierarchical clustering

However, no quantitative information is provided

Sample points Venn Diagram

Lecture 12: Unsupervised Learning28

Page 8: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Non-parametric Clustering: Hierarchical Clustering

Dendogram

Dendogram is another way to represent a hierarchical clustering

Binary Tree

Indicate the similarity value

Sample points Dendogram

Similarity x1

& (x2 x3)

One Cluster

Similarity x6 & x7

Two Clusters

Lecture 12: Unsupervised Learning29

Dr. Patrick Chan @ SCUT

Non-parametric Clustering

Iterative Optimization Algorithm

Find a reasonable initial partition

Repeat until stable

Move sample(s) from one cluster to another such that the objective function is improved the most

2 4

2

4

2 4

2

4

Je = 10.29 Je = 8.11

Move [2 3]

from cluster x

to cluster o

Lecture 12: Unsupervised Learning30

Dr. Patrick Chan @ SCUT

Non-parametric Clustering: Iterative Optimization Algorithm

K-means

A well-known technique: K-means

Criterion Function:

Assume there are k clusters

We use k=3 in the following example

�∈�

��

Lecture 12: Unsupervised Learning31

Dr. Patrick Chan @ SCUT

Non-parametric Clustering: Iterative Optimization Algorithm

K-means

We use k=3 in the following example

1. Initialization

Randomly assign the

center of each cluster

3. Re-calculate mean

Compute the new means

using new samples

2. Assign Samples

Assign samples to

closest center

Repeat until stable (no sample moves again)Lecture 12: Unsupervised Learning32

Page 9: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

1st

step

2nd

step

3rd

step

Centers are

changed

Centers are

not changed

K-means

Stops

Centers are

changed

Non-parametric Clustering: Iterative Optimization Algorithm

K-means

Lecture 12: Unsupervised Learning33

Dr. Patrick Chan @ SCUT

Non-parametric Clustering: Iterative Optimization Algorithm

K-means

Decreases the objective function efficiently

Algorithm converges

Drawback:

Trapped at Local Minimum Global Minimum

may be trapped at local minimum (similar to gradient descent)

Lecture 12: Unsupervised Learning34

Dr. Patrick Chan @ SCUT

Parametric Clustering

Different from non-parametric clustering, assumption is made on the distribution

Estimate the parameters of the assumed distribution

Similar to Maximum likelihood method in Supervised Learning

Lecture 12: Unsupervised Learning35

3 345 54

6 64

Dr. Patrick Chan @ SCUT

Parametric Clustering

Maximum Likelihood

Assume samples in each class follow a normal distribution

Using samples to estimate and

7�"8 9�"8

7�"8 9�"8

Separate into

c sub-problems

based on

class label

One Gaussian

for each class

(each problem)

Find out the best Gaussian

function for the samples in a class

Lecture 12: Unsupervised Learning36

Page 10: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Parametric Clustering

How about no label information is provided?

Assume samples in a cluster follows Gaussian Distribution

However, all samples as a whole should be considered as no label information

Gaussian Mixture Model

How we can determine and ?

7�"8 9�"8

7�"8 9�"8

Lecture 12: Unsupervised Learning37

Dr. Patrick Chan @ SCUT

Parametric Clustering

Gaussian Mixture Model

Multivariate Gaussian Distribution

c : a cluster

d : the number of features

Lecture 12: Unsupervised Learning38

$ ,$ � ,�,$ ,�

:

���

:

$��,$ ,�

:

���

:

$��

$ ,$ � ,�,$ ,�

:

���

:

$��,$ ,�

:

���

:

$��

Dr. Patrick Chan @ SCUT

Parametric Clustering

Gaussian Mixture Model

Maximize Likelihood for the class i

Our aim is to maximize for all classes

Lecture 12: Unsupervised Learning39

= ; <(*|?)

∈�

∝ 2 ln <(*|?)

∈�

<(�|?)

= 2 − 12 2 2 *$ − 7,$ *� − 7,�

9,$ 9,�

:

���

:

$��− 1

2 2 2 ln 9,$ 9,� :

���

:

$��

∈�

CD = 2 2 − 12 2 2 *$ − 7,$ *� − 7,�

9,$ 9,�

:

���

:

$��− 1

2 2 2 ln 9,$ 9,� :

���

:

$��

∈�

��

by i.i.d

Monotonicity increasing

Dr. Patrick Chan @ SCUT

Parametric Clustering

Gaussian Mixture Model

How to maximize J?

If the label is known (in supervised learning), and can be calculated directly

Lecture 12: Unsupervised Learning40

D$ ,$ � ,�

,$ ,�

:

���

:

$��,$ ,�

:

���

:

$��

∈�

��

,$� $ ,$

∈� ,$ $

∈�

Page 11: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Parametric Clustering

Gaussian Mixture Model

For unsupervised problem, the class(cluster) of a sample is unknown

and cannot be calculated directly

Gradient Decent can be used

Similar to K-mean

Lecture 12: Unsupervised Learning41

Dr. Patrick Chan @ SCUT

Parametric Clustering

Gaussian Mixture Model

Randomly initial and for each class

Repeat until stable

For each x, calculate likelihood (p(x|c)) for all c

x is assigned to the cluster c with the largest

likelihood

Maybe more than one results due to random initialization

Lecture 12: Unsupervised Learning42

Dr. Patrick Chan @ SCUT

Parametric Clustering

Gaussian Mixture Model

Example: iis diagonal

Features are statistically independent and has the different variance

Lecture 12: Unsupervised Learning43

i

1

2

d

2

Dr. Patrick Chan @ SCUT

Parametric Clustering

Gaussian Mixture Model

Lecture 12: Unsupervised Learning44

-3.7 -0.4

0.4 0.1

0.4 -1.7

-0.4 -1

-1.3 -1.7

1 3.3

1.2 5.2

1.3 0.3

1.1 -0.8

0.5 2.8

-7.3 -7.8

0.0 0.0

-1.7 -1.8

-0.7 -0.9

-2.6 -2.9

-5.5 -5.1

-13.6 -13.0

-0.7 -0.6

-0.9 -0.9

-3.7 -3.4

1

2

1

1

1

2

2

2

2

2

-1.3 -1.2 0.9 1.8

3.2 0.4 0.1 5.3

0.1 0.1 0.2 0.2

1 1 1 1

-2.6 -530.8

-5.8 -6.4

-1.2 -6.6

-0.4 -43.1

-1.0 -122.4

-65.9 0.1

-132.9 -1.9

-7.8 -3.4

-1.0 -0.7

-52.1 -4.1

-0.9 -0.9 1.0 2.2

2.9 0.6 0.1 5.8

-1.4 -1183.4

-2.1 -19.9

-1.4 -20.1

-0.6 -106.7

-1.3 -285.7

-23.3 0.5

-47.9 -1.3

-2.8 -3.6

-0.9 0.1

-18.2 -13.8

-0.6 -0.7 1.0 2.6

3.1 0.8 0.1 6.3

-0.6 -0.7 1.0 2.6

3.1 0.8 0.1 6.3

1

1

1

1

1

2

2

2

2

2

1

1

1

1

1

2

2

1

2

2

1

1

1

1

1

2

2

1

2

2

< * E = F = − 12 2 *$ − 7,$

9,$ �

:

$��− 1

2 2 ln 9,$ �:

$��9G,$

� = 1�

2 *$ − 7̂,$�

∈� 7̂,$ = 1

�2 *$

∈� x1 x2

-1.5 -1156.6

-1.5 -15.8

-1.7 -15.9

-0.9 -97.2

-1.7 -270.6

-15.0 0.4

-31.3 -2.9

-2.0 -6.1

-1.0 -0.9

-11.7 -10.3

c1c27̂

9G�

c1c2 c1

c2 c1c2 c1

c2

< * E = 1 < * E = 2 < * E = 1 < * E = 2 < * E = 1 < * E = 2 < * E = 1 < * E = 2

Page 12: Machine Learning Supervised Learning Lecture 12 ... · Lecture 12: Unsupervised Learning 27" , $ ,01 $ $ ∈ ∈ ) Dr. Patrick Chan @ SCUT Non-parametric Clustering: Hierarchical

Dr. Patrick Chan @ SCUT

Parametric Clustering

Gaussian Mixture Model

When i= 2 I for all class,

GMM is equal to the resultof using within-cluster scatter

Objective function can be simplified

Vector Form

Lecture 12: Unsupervised Learning45

i

2

2

CD = 2 2 − 12 2 2 *$ − 7,$ *� − 7,�

9,$ 9,�

:

���

:

$��− 1

2 2 2 ln 9,$ 9,� :

���

:

$��

∈�

��

= 2 2 − 12 2(*$ − 7,$)�

:

$��

∈�

��

D �

∈�

��� �

∈�

��

Dr. Patrick Chan @ SCUT

Parametric Clustering

Gaussian Mixture Model

Distribution is considered

Pro: More stable than k-mean generally

Less sensitive to noisy samples

Con: Misleading if not Gaussian distribution

These 4 datasets contains the same mean and variance

Lecture 12: Unsupervised Learning46

Dr. Patrick Chan @ SCUT

Clustering: Number of Clusters

How to decide the number of clusters?

Possible solution:

Try a range of c and see which one has the

lowest criterion value

Lecture 12: Unsupervised Learning47

Dr. Patrick Chan @ SCUT

References

http://users.umiacs.umd.edu/~jbg/teaching/INST_414/

Lecture 12: Unsupervised Learning48