1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.

11

Unsupervised LearningUnsupervised Learningand Clusteringand Clustering

Shyh-Kang JengShyh-Kang JengDepartment of Electrical Engineering/Department of Electrical Engineering/Graduate Institute of Communication/Graduate Institute of Communication/

Graduate Institute of Networking and MultiGraduate Institute of Networking and Multimedia, National Taiwan Universitymedia, National Taiwan University

22

Supervised vs. Unsupervised Supervised vs. Unsupervised LearningLearning

Supervised training proceduresSupervised training procedures– Use samples labeled by their category Use samples labeled by their category

membership membership

Unsupervised training proceduresUnsupervised training procedures– Use unlabeled samplesUse unlabeled samples

33

Reasons for interestReasons for interestCollecting and labeling a large set of Collecting and labeling a large set of sample patterns can be costlysample patterns can be costly– e.g., speeche.g., speech

Training with large amount of unlabeled Training with large amount of unlabeled data, and using supervision to label the data, and using supervision to label the groupings found groupings found – For “data mining” applicationsFor “data mining” applications

Improved performance for data with slow Improved performance for data with slow changes of characteristics of patterns by changes of characteristics of patterns by tracking in an unsupervised modetracking in an unsupervised mode– Automated food classification when seasons Automated food classification when seasons

changechange

44

Reasons for interestReasons for interestCan use unsupervised methods to Can use unsupervised methods to find features that will then be useful find features that will then be useful for categorizationfor categorization– Data dependent “smart preprocessing” Data dependent “smart preprocessing”

or “smart feature extraction”or “smart feature extraction”

Perform exploratory data analysis Perform exploratory data analysis and gain insights into the nature or and gain insights into the nature or structure of the datastructure of the data– Discovery of distinct clusters may Discovery of distinct clusters may

suggest us to alter the approach to suggest us to alter the approach to designing the classifierdesigning the classifier

55

Basic Assumptions to Begin withBasic Assumptions to Begin withSamples come from a known number c Samples come from a known number c of classesof classesPrior probabilities Prior probabilities PP((jj)) for each class ar for each class are knowne knownForms for the class-conditional probabiliForms for the class-conditional probability densities ty densities pp((xx||jj,,jj)) are known are knownValues for parameter vectors Values for parameter vectors 11, …, , …, cc ar are unknowne unknownCategory labels are unknownCategory labels are unknown

66

Mixing DensityMixing Density

parameters mixing:)(

densitiescomponent :),|(

,,

)(),|()|(

form theof

samplesfor function density y probabilit

1

1

j

jj

tc

c

jjjj

P

p

Ppp

x

xx

77

Goal and ApproachGoal and Approach

Use samples drawn from the mixture Use samples drawn from the mixture density to estimate the unknown density to estimate the unknown parameter vector parameter vector With known With known , we can decompose , we can decompose the mixture into its components and the mixture into its components and use a maximum a posteriori classifier use a maximum a posteriori classifier on the derived densitieson the derived densities

88

Existence of SolutionsExistence of SolutionsSuppose unlimited number of samples aSuppose unlimited number of samples and nonparametric methods are availablend nonparametric methods are availableIf there is only one value of If there is only one value of that will pro that will produce the observed values for duce the observed values for pp((xx||) ) , a sol, a solution is possible in principleution is possible in principleIf several different values of If several different values of can produc can produce the same values for e the same values for pp((xx||)) , then there is , then there is no hope of obtaining a unique solutionno hope of obtaining a unique solution

99

Identifiable DensityIdentifiable Density

parameters individual theofany infer not can we

if able,unidentifi completely is )|(

data ofamount

infinitean fromeven , unique arecover

not can weif leidentifiabnot is )|(

)'|()|(

such that an least at exists there '

if leidentifiab is )|(

x

x

xx

x

p

p

pp

x

p

1010

An Example of Unidentifiable An Example of Unidentifiable Mixture of Discrete DistributionsMixture of Discrete Distributions

2.1

4.0)|0(,6.0)|1(

0 if2

11

1 if2

1

)1(2

1)1(

2

1)|(

binary :

21

21

21

122

111

xPxP

x

x

xP

x

xxxx

1111

An Example of Unidentifiable An Example of Unidentifiable Mixture of Gaussian DistributionsMixture of Gaussian Distributions

)()(when

2

1exp

2

)(

2

1exp

2

)()|(

21

22

2

21

1

PP

xP

xP

xp

1212

Maximum-Likelihood EstimatesMaximum-Likelihood Estimates

ˆ :estimate likelihood-maximul

)|()|(

samples observed theof likelihood

unknown and fixed is vector parameter full

)(),|()|(

fromtly independen

drawn samples unlabeled : ,,

1

1

1

n

kk

c

jjjj

n

pDp

Ppp

nD

x

xx

xx

1313


|

)(),|(),|(

yprobabilitposterior

ift independen

lyfunctional are and of elements that assume

)(),|(|

1

|ln

11

1

k

iiikki

ji

c

jjjjk

n

k k

n

kk

p

Ppp

ji

Ppp

l

pl

ιι

x

xx

xx

x

1414


0)ˆ,|(ln)ˆ,|(

),|(ln),|(

1

1

n

kikkki

n

kikkki

pP

pPl

i

ii

xx

xx

1515

Maximum-Likelihood Estimates for Maximum-Likelihood Estimates for Unknown PriorsUnknown Priors

1)(

,,1,0)(

sconstraint subject to ,)( and

over extends of valuemaximum for thesearch

)(),|(ln)|(ln

1

1 11

c

ii

i

i

n

k

c

jjjjk

n

kk

P

ciP

P

l

Pppl

xx

1616

Maximum-Likelihood Estimates for Maximum-Likelihood Estimates for Unknown PriorsUnknown Priors

c

j jjjk

iiikki

n

kiikki

n

kkii

i

Pp

PpP

pP

Pn

P

P

i

1

1

1

)(ˆ)ˆ,|(

)(ˆ)ˆ,|()ˆ,|(ˆ

0)ˆ,|(ln)ˆ,|(ˆ

)ˆ,|(ˆ1)(ˆ

:)(for estimates likelihood-maximum

x

xx

xx

x

1717

Application to Normal MixturesApplication to Normal Mixtures

Component densities Component densities pp((xx||ii,,ii)~)~NN((ii,,ii))

Three casesThree casesCaseCase ii ii PP((ii)) cc

11 ?? XX XX XX

22 ?? ?? ?? XX

33 ?? ?? ?? ??

1818

Case 1: Unknown Mean VectorsCase 1: Unknown Mean Vectors

c

j jjjk

iiikki

n

k ki

n

k kkii

tc

n

kikiki

iiii

iit

iid

ii

Pp

PpP

P

P

P

p

p

i

1

1

1

11

1

1

12/12/

)()ˆ,|(

)()ˆ,|()ˆ,|(

)ˆ,|(

)ˆ,|(ˆ

ˆ,,ˆˆ,0ˆ)ˆ,|(

),|(ln

2

1)2(ln),|(ln

x

xx

x

xx

xΣx

xΣx

xΣxΣx

1919


22

2121

2

1exp

23

2

2

1exp

23

1),|(

x

xxp

2020


257.1ˆ

,085.2ˆ

668.1ˆ

,130.2ˆ

2,2

2

1

2

1

21

2121

Case 2: All Parameters UnknownCase 2: All Parameters Unknown

c

j jjjk

iiikki

n

k ki

tikik

n

k kii

n

k ki

n

k kkii

n

kkii

Pp

PpP

P

P

P

P

Pn

P

1

1

1

1

1

1

)(ˆ)ˆ,|(

)(ˆ)ˆ,|()ˆ,|(ˆ

)ˆ,|(ˆ

ˆˆ)ˆ,|(ˆˆ

)ˆ,|(ˆ

)ˆ,|(ˆˆ

)ˆ,|(ˆ1)(ˆ

x

xx

x

xxxΣ

x

xx

x

2222

Case 2: All Parameters UnknownCase 2: All Parameters Unknown

c

j jjkjt

jkj

iikit

iki

c

j jjjk

iiikki

P

P

Pp

PpP

1

12/1

12/1

1

)(ˆˆˆˆ21

expˆ

)(ˆˆˆˆ21

expˆ

)(ˆ)ˆ,|(

)(ˆ)ˆ,|()ˆ,|(ˆ

xΣxΣ

xΣxΣ

x

xx

2323

kk-Means Clustering-Means Clustering

n

k ki

n

k kkii

ki

ki

kmik

ikit

ikki

P

P

miP

P

P

1

1

2

1

)ˆ,|(ˆ

)ˆ,|(ˆˆapply y iterativel

otherwise0

if1)ˆ,|(ˆ

as )ˆ,|(ˆ eapproximat

, nearest to ˆ find ,ˆ computemerely

small is

ˆˆˆ when large is )ˆ,|(ˆ

x

xx

x

x

xx

xΣxx

2424


initialize initialize nn, , cc, , 11, , 22, …, , …, ccdo do classify classify nn samples according to nearest samples according to nearest ii

recompute recompute iiuntil until no change in no change in iireturn return 11, , 22, …, , …, cc

endend

2525


Complexity Complexity OO((ndcTndcT))

In practice, the number of iterations In practice, the number of iterations TT is is generally much less than the number of generally much less than the number of samplessamplesThe values obtained can be accepted as The values obtained can be accepted as the answer, or can be used as starting pthe answer, or can be used as starting points for more exact computationsoints for more exact computations

2626


688.1ˆ

130.2ˆ

likelihood

-maximum

684.1ˆ

176.2ˆ

2

1

2

1

2727


1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.

Documents

unknown mean vectorscase

unknown priorsapplication

unsupervised methods

n samples

unlabeled data

number of samplesthe

different values of

approachuse samples