1 Classification & Clustering 魏志達 Jyh-Da Wei -- Parametric and Nonparametric Methods Introduction to Machine Learning (Chap 4,5,7,8), E. Alpaydin.

1

Classification & Clustering

魏志達 Jyh-Da Wei

-- Parametric and Nonparametric Methods

Introduction to Machine Learning (Chap 4,5,7,8), E. Alpaydin

2

Classes vs. Clusters

Classification: supervised learning– Pattern Recognization, K Nearest Neighbor, Multilayer Perceptron

Clustering: unsupervised learning– K-Means, Expectation Maximization, Self-Organization Map

Parametric Nonparametric Networks

Classes PR Kernel, KNN MLP

Clusters K-Means, EM Agglomerative SOM

3







4

Bayes’ Rule

x

xx

ppP

PCC

C|

|

posterior

likelihoodprior

evidence

0 0

0

0 1

1 1 0 1

0 0 10

1

| ( , ) ( , )

( | ) , | | 1

choose if |

(i.e., ,

max |

max , )

|

|

i i

i k k

k k

P C P C

p x p x C P C p x C p x C

p C x p C x P C xp x

C

p x C P C

p x C P C

P x C P C

P

x

P C x C x

因為給定 x 之值則 p(x) 均等

5

Bayes’ Rule: K>2 Classes

K

kkk

ii

iii

CPCp

CPCp

pCPCp

CP

1

|

|

||

x

x

xx

x

1

0 and 1

choose if | max |

(i.e., , max , )

K

i ii

i i k k

i k k

P C P C

P x C P x C

C P C P C

x x

因為給定 x 之值則 p(x) 均等

6

Gaussian (Normal) Distribution

2

2

2exp

2

1 x-xp

p(x) = N ( μ, σ2)

Estimate μ and σ2:

μ σ

N

mxs

N

xm

t

t

t

t

2

2

2

2

2exp

2

1 xxp

7

Equal variances

Single boundary athalfway between means

P(C1)=P(C2)

8

Variances are different

Two boundaries

P(C1)=P(C2)

9

Multivariate Normal Distribution

μxμxx

μx

1212 2

1exp

2

1Σ

Σ

Σ

T

//d

d

p

~ ,N

10

Multivariate Normal Distribution Mahalanobis distance: (x – μ)T ∑–1 (x – μ)

measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)

Bivariate: d = 2

2221

2121

iiii /xz

zzzzx,xp

2

2212122

21

21 212

1exp

12

1

11

Bivariate Normal

12

13

Estimation of Parameters

ˆt

iti

t tit

i tit

Tt t ti i it

i tit

rP C

N

rm

r

rS

r

x

x m x m

14

likelihoods

posterior for C1

discriminant: P (C1|x ) = 0.5

只分二類的話，剛好以 0.5 為界線

15

break

16







17

Parametric vs. Nonparametric Parametric Methods

– Advantage: it reduces the problem of estimating a probability density function (pdf), discriminant, or regression function to estimating the values of a small number of parameters.

– Disadvantage: this assumption does not always hold and we may incur a large error if it does not.

Nonparametric Methods– Keep the training data;“let the data speak for itself”– Given x, find a small number of closest training instances

and interpolate from these– Nonparametric methods are also called memory-based or

instance-based learning algorithms.

18

Density Estimation

Given the training set X={ xt }t drawn iid (independent and identically distributed) from p(x)

Divide data into bins of size h Histogram estimator: (Figure – next page)

# in the same bin as ˆ

tx xp x

Nh

Extreme case: p(x)=1/h, for exactly consulting the sample space

該 xt 項構成集合之第 t 項

19

Nh

xx#xp̂

t as bin same the in

0.375

20

Density Estimation

Given the training set X={ xt }t drawn iid from p(x) x is always at the center of a bin of size 2h Naive estimator: (Figure – next page)

or

( 讓每一個 xt 投票 )

Nh

hxxhx#xp̂

t

2

otherwise0

1if 21

1

1

u/uw

hxx

wNh

xp̂N

t

t

w(u): 依地緣關係投票，贊成票計 1/2, [-1,1] 區間積分值為 1

21

Nh

hxxhx#xp̂

t

2

h=0.25

h=0.5

Naïve estimator: h=1

22

Kernel Estimator Kernel function, e.g., Gaussian kernel:

Kernel estimator (Parzen windows): Figure – next page

If K is Gaussian, then will be smooth having all the derivatives.

N

t

t

hxx

KNh

xp̂1

1

2exp

2

1 2uuK

p̂

K(u): 依地緣關係給分，實數域積分值為 1

23

5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

11

0

K u( )

55 u

2exp

2

1 2uuK

24

N

t

t

hxx

KNh

xp̂1

1

25

Generalization to Multivariate Data Kernel density estimator

with the requirement that

Multivariate Gaussian kernel

spheric

ellipsoid

1

1ˆ

tN

dt

p KNh h

x x

x

uuu

uu

1212

2

21

exp2

1

2exp

2

1

SS

T//d

d

K

K

( ) 1dR

K x dx

26

k-Nearest Neighbor Estimator

Instead of fixing bin width h and counting the number of instances, fix the instances (neighbors) k and check bin width

dk(x): distance to kth closest instance to x

xNdk

xp̂k2

27

xNdk

xp̂k2

同時向兩側長，看多遠可吃到 k 個 samples

28

Nonparametric Classification(kernel estimator)

1

1

1 ˆˆ | ,

1ˆ, |

tNt i

i i idti

tNt

i i i idt

Nx xp x C K r P C

N h h N

x xp x C p x C P C K r

Nh h

rit 視 xt 是否遲於 Ci 而定 0/1

原本要比較 p(Ci|x)=p(x,Ci)/p(x) 之值何者大但給定 x 之值則 p(x) 均等 , 此處大家都不寫，式子較漂亮

可不看係數只看後項，意義為累計各委員評分這些評分為依地緣而定的正實數值

29

Nonparametric Classification k-nn estimator (1)

For the special case of k-nn estimator

where

ki : the number of neighbors out of the k nearest that belong to ci

Vk(x) : the volume of the d-dimensional hypersphere centered at x,

with radius

cd : the volume of the unit sphere in d dimensions For example,

xVN

kCxp

ki

ii |ˆ

)(kxxr

ddk crV

;3

43

;2

;21

33

3

22

2

11

rcrVd

rcrVd

rcrVd

k

k

k

30

Nonparametric Classification k-nn estimator (2)

From

Then

xVN

kCxp

ki

ii |ˆ

k

k

xp

CPCxpxCP iii

i ˆ

ˆ|ˆ|ˆ

xNV

kxp

kˆ

N

NCP i

i ˆ

要比較 p(Ci|x)=p(x,Ci)/p(x) 之值何者大雖然給定 x 之值則 p(x) 均等 , 但此處大家寫出來，推得的式子較漂亮

意義為累積找到 k samples 之時何類的出席數最多

31

break

32







33

Classes vs. Clusters Supervised: X = { xt ,rt }t

Classes Ci i=1,...,K

where p ( x | Ci) ~ N ( μi , ∑i )

Φ = {P (Ci ), μi , ∑i }Ki=1

Unsupervised : X = { xt }t

Clusters Gi i=1,...,k

where p ( x | Gi) ~ N ( μi , ∑i )

Φ = {P ( Gi ), μi , ∑i }ki=1

Labels, r ti ?

k

iii Ppp

1

| GGxx

K

iii Ppp

1

| CCxx

t

ti

T

it

t itt

ii

t

ti

t

tti

it

ti

i

r

r

r

r

N

rCP̂

mxmx

xm

S

34

k-Means Clustering Find k reference vectors (prototypes/codebook

vectors/codewords) which best represent data Reference vectors, mj, j =1,...,k Use nearest (most similar) reference:

Reconstruction error

jt

jit mxmx min

otherwise0

minif 1

1

jt

jit

ti

t i itt

ikii

b

bE

mxmx

mxm X

希望群中心造成的總偏離值最小

35

Encoding/Decoding

otherwise0

minif 1 jt

jit

tib

mxmx

36

k-means Clustering

1. Winner takes all2. 不做逐步修正，而是一口氣取群平均3. 下頁有實例，上課再舉反例 ( 前方將士變節 )

37

38

EM in Gaussian Mixtures zt

i = 1 if xt belongs to Gi, 0 otherwise (labels r ti of

supervised learning); assume p(x|Gi)~N(μi,∑i) E-step:

M-step:

Use estimated labels in place of unknown labels

ti

lti

j jl

jt

il

it

lti

hP

PpPp

,zE

,G

G,GG,G

X

x

xx

|

||

t

ti

Tli

t

t

li

ttil

i

t

ti

t

ttil

it

ti

i

h

h

h

h

N

hP

111

1

mxmx

xm

S

G

擁有 P(Gi ) 做後援就不怕將士變節

39

P(G1|x)=h1=0.5

40







41

Agglomerative Clustering

Start with N groups each with one instance and merge two closest groups at each iteration

Distance between two groups Gi and Gj:– Single-link:

– Complete-link:

– Average-link, centroid

sr

,ji ,d,d

js

ir

xxxx GG

GG

min

sr

,ji ,d,d

js

ir

xxxx GG

GG

max

42

Dendrogram

Example: Single-Link Clustering

人類

侏儒黑猩猩

黑猩猩

大猩猩

長臂猿

獼猴

可以動態分群

1 Classification & Clustering 魏志達 Jyh-Da Wei -- Parametric and Nonparametric Methods Introduction to Machine Learning (Chap 4,5,7,8), E. Alpaydin.

Documents