Top Banner
1 INF4300 Unsupervised classification, classifier evaluation and data exploration Asbjørn Berge 16-09-2009 Today’s plan 2 Motivation for exploring data and reducing dimensionality Feature selection Distance and performance measures Search strategies Unsupervised classification / clustering k-means Hierarchical clustering Probabilistic clustering (Mixture of Gaussians) Classifier performance and errors Estimating error Confusion matrix Training and test dataset, and how to efficiently use your data. Comment on complexity generalization performance tradeoff Outliers / rejection and doubt Other classification techniques (non-parametric methods) k-NN Parzen windows prtools s ) ( ) ( ) | ( ) | ( x x x p P p P j j j Bayesian statistics Decision making P(data) P(class|data) P(class|data) Pr(class 1)=0.3 Pr(class 2)=0.2 Pr(class 3)=0.5 Classification using a Gaussian model Train the classifier by estimating i and Σj for each class Classifying a new sample: Compute for each class the conditional probability density: Compute the posterior probability Assign the label correspondi ng to the class with the highest posterior probability s class to belonging samples training all over is sum the where t s m M m s m s s M m m s s x x M x M s s ˆ ˆ 1 ˆ , 1 ˆ 1 1 s s t s s P s x x x p 1 2 / 1 2 / 2 1 exp 2 1 ) | ( ) ( ) | ( | s s s P x p x P normc qdc Two special cases for covariance Diagonal covariance matrices, Σj =σ 2 I Discriminant functions are linear functions With P features, estimate i (1xP vector) for each class. Classes can be thought of as hyperspheres Class-specific covariance matrices, Σj arbitrary Discriminant functions are quadratic functions With P features, estimate i (1xP vector) Σj is PxP matrix with P(P-1)/2 unique elements Classes can be thought of as hyperellipsoides nmc qdc The ”curseof dimensionality Very simple example, three class classification problem, 9 samples Divide the space into bins and classify to majority
18

Statistics and classification

Mar 21, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistics and classification

1

INF4300

Unsupervised classification,

classifier evaluation and

data exploration

Asbjørn Berge 16-09-2009

Today’s plan

2

Motivation for exploring data and reducing dimensionality Feature selection

Distance and performance measures

Search strategies Unsupervised classification / clustering

k-means Hierarchical clustering

Probabilistic clustering (Mixture of Gaussians)Classifier performance and errors

Estimating error

Confusion matrix Training and test dataset, and how to efficiently use your data. Comment on complexity generalization performance tradeoff

Outliers / rejection and doubt Other classification techniques (non-parametric methods)

k-NN Parzen windows

prtoolss

)(

)()|()|(

x

xx

p

PpP

jj

j

Bayesian statistics – Decision making

P(data)

P(class|data)

P(class|data)

Pr(class 1)=0.3Pr(class 2)=0.2Pr(class 3)=0.5

Classification using a Gaussian model Train the classifier by estimating i and Σj for each class

Classifying a new sample:

Compute for each class the conditional probability density:

Compute the posterior probability

Assign the label correspondi ng to the class with the highestposterior probability

s class to belonging samples training allover is sum the where

t

sm

M

m sm

s

s

M

m m

s

s

xxM

xM

s

s

ˆˆ1ˆ

,1

ˆ

1

1

ss

t

s

s

Ps xxxp

1

2/12/ 2

1exp

2

1)|(

)()|(| sss PxpxP

normc

qdc

Two special cases for covariance

Diagonal covariance matrices, Σj=σ2I

Discriminant functions are linear functions

With P features, estimate i (1xP vector) for each class.

Classes can be thought of as hyperspheres

Class-specific covariance matrices, Σj arbitrary

Discriminant functions are quadratic functions

With P features, estimate

i (1xP vector)

Σj is PxP matrix with P(P-1)/2 unique

elements

Classes can be thought of as hyperellipsoides

nmc

qdc The ”curse” of dimensionality

Very simple example, three class classification problem, 9

samples

Divide the space into bins and classify to majority

Page 2: Statistics and classification

2

The ”curse” of dimensionality

Keep the bin resolution, and increase dimensionality 3 bins in 1D increases to 32 bins in 2D

Roughly 3 examples per bin in 1D, if we want to preserve the density of examples we now need 27 samples!

The ”curse” of dimensionality

The problem escalates

rapidly!

81 examples needed to

preserve density

If we use the original amount of

samples (9), 2/3 of the feature

space is empty!

The ”curse” of dimensionality In practice, the curse means that, for a given sample size,

there is a maximum number of features one can add before

any classifier starts to degrade.

How do we beat the ”curse of dimensionality”?

Use simpler parameter estimates for the Gaussian case

Use diagonal covariance matrices

Apply regularized covariance estimation (INF 5300)

Generate few, but informative features

Careful feature design given the applicati on

Reducing the dimensionality

Feature selection

Feature transforms (INF 5300)

Faced with model choices. Need performance metrics

(error measures) and methods to evaluate these.

Feature evaluation and selection Why reduce the number of features we use to describe the

data?

Countering overfitting More room for samples to reside in, when dataset

dimensionality increases

Reducing variance in parameter estimates Want to use as many samples as possible to estimate each

parameter in the model

In the extreme case – make estimates numerically stable Samples ~ dimensionali ty of data means that e.g. covariance

matrix at risk of being singular and impossible to invert

Common rule of thumb To get reasonable estimates we need a number of samples 5-

10 times the dimensionali ty

Evaluating classification performance

To choose the best classifier for a task, we need to define

some metrics.

There is no superior classifier for all kinds of problems, so

we‟re stuck with using heuristics to make a choice.

Some classification approaches have parameters we can

tune using these heuristics

We would like to know what kind of performance we can

expect on new (unseen) data

Page 3: Statistics and classification

3

Overall error

One common way of defining the quality is overall error rate

Usually, it is a weighted average of errors from each class weighted by class prior

What would happen if?

We define error as number of correct samples ratioed by the

total number of samples?

Our prior estimates are ”wrong”?

E.g. “my classifier gives right answers 80% of the time” Is it

good? Why (not)? What happens if 80% of the data has the

“N” label and my classifier always say “N”?

j

jP

samples) total(#

)(*) class from samplesincorrect (# j

testc Confusion matrix

A convenient way of evaluating classifiers – avoiding such

pitfalls - is the confusion matrix

Plot the true class labels versus the class labels assigned

by the classificator

From this we can read the distribution of incorrectly

classified samples

confmat

Confusion matrix

Class ω1 Class ω2 Class ω3 Total

Class ω1 80 15 5 100

Class ω2 5 140 5 150

Class ω3 25 50 125 200

Total 110 205 135 450

True class labels

Cla

ssific

ation

confmat

Confusion matrix – derived measures of

classification accuracyClas s ω1 Clas s ω2 Clas s ω3 Tota l

Clas s ω1 80 15 5 100

Clas s ω2 5 140 5 150

Clas s ω3 25 50 125 200

Tota l 110 205 135 450

True class labels

Cla

ssific

ation

Overall accuracy

Makes sense to evaluate normalized by (true) classsize

!! Many researchers do not, however!

Precision How accurate (precise) is the classification on each

class?

#”correct label ω”/#”total classified label ω”

Look at rows

Recall

What is the chance of choosing correctly within eachclass?

#”correct label ω”/#”total true label ω”

Calculate from the columns

Kappa

How much better is the classifier than randomguessing?

Compare diagonal of your confusion matrix with onedue to random chance

confmat

ji ij

i ii

,

Accuracy

i

iii

)Precision(

i

iii

)Recall(

i iiji ij

ijji jii ii

rP

rPcP

,

,/*

)(1

)()(

Outliers and doubt

Two rather vague errors in a classification problem is outliers and doubt samples

We might want an ideal classifier to report

‟this sample is from class l‟ (usual case)

‟this sample is not from any of the classes‟ (outlier)

‟this sample is too hard for me‟ (doubt/reject)

The two last cases should lead to an rejection of the sample

rejectcOutliers

Outliers are heuristically defined as ”..samples which did not (or are thought not to have) come from the assumed population of samples”

The outliers can result from some breakdown in preprocessi ng (or even before we aquire an image)

One way to deal with outliers is to model them as an own class, for example a gaussian with a very large variance, and estimate prior probability from the training data

Another approach is to decide on some threshold on the aposteriori – and if a sample falls below this threshold for all classes, then declare it an outlier.

rejectc

Page 4: Statistics and classification

4

Doubt samples

Doubt samples are samples for which the class with the highest probability is not significantly more probable than some of the other classes (e.g. two classes have essentially equal probability).

Classify as doubt if p(x| i)P( i) < 1-c, where c is given by the user.

c must be in the range [0, K-1/K] if we have K classes.

Some classification software can allow the user to specify thresholds for doubt

Other software choose the simpler solution of just guessing

rejectc

Training and test dataset, and how to

efficiently use your data.

In the ideal case we want to maximize the size of the

training and test dataset

Obviously there is a fixed amount of available data with

known labels

A very simple approach is to separate the dataset in two

random subsets, but we can do better!

The number of features for each object is an important

factor with regards to the amount of available data (further

on this next lecture)

Back to good use of training data

“Hold out”, ok for large (>1000 objects) datasets

Simply put away a part of the training data, say 1/3 of the

samples chosen randomly – train on the 2/3 remaining,

and evaluate classifier performance (error and so on) on

the 1/3.

Can repeat this a couple of times, and report the average

of repetitions.

Problem: repeated draws overlap

gendat

x1 … x2 … xp class

o2

o1

on

Train

Test

x1 … x2 … xp class

o2

o1

on

Crossvalidation / Leave – n - Out

A very simple (but computationally complex) improvement

on the hold-out

Train the classifier on a set of N-n samples

Divide the dataset into blocks of n samples

Test the classifier on the n remaining samples

Repeat n/N times (dependent on subsampli ng) rotating through

data

Report average performance on the repeated experiments

crossval

x1 … x2 … xp class

o2

o1

on

Train

Test

x1 … x2 … xp class

o2

o1

on

How many blocks to divide the data in?

More is usually better, but trade-off with computati onal complexity

Usually five or ten blocks is used, often denoted 5-CV, 10-CV

Average, and spread, of classification error can be reported

Can be designed to guarantee samples from each class.

(Stratification)

The logical extreme of crossvalidation is to leave only one

sample out each repetition

Extremely time consuming

Since all samples are visited once , no bias from random

subsampling

Stratification impossible

Crossvalidation / Leave – n - Outcrossval

Exploratory data analysis

For a small number of features,

manual data analysis to study

the features is recommended.

Choose intelligent features.

Evaluate e.g.

Error rates for single-featur e

classification

Scatter plots

Scatter plots of featurecombinations

scatterdui

Page 5: Statistics and classification

5

What are good features?

Clearly, we need to choose good features

How do we quantify feature quality?

A good feature is simple to ”learn”

This is often related to class separation

Class separation

Measure distance between all points or just class means?

Many distance measures are ”pairwise”

Use average or minimum?

All these distance measures can be represented as a scalar J, also called an ”objective function”

Typical class separation measures

Euclidean distance distance between pair of means

Mahalanobis distance sometimes called statistical distance

distance between pair of classes weighed by probability density

Inter/intra class distance Measure ratio of distance between class means and class ”size”

Classifier accuracy How good does a classifier perform on the dataset?

Evaluate with hold-out or cross-validati on

12 13

21 23

31 32

0

0

0

d d

d d

d d

D

Distance matrices

Once a distance measure is defined, we can calculate the distance between objects. These objects could be individual observations, groups of observations (classes)

For N objects, we then have a symmetric distance matrix D whose elements are distances between objects i and j.

1

23

d12

feateval

Euclidean distance

A possible distance measure for spaces equipped with a Euclidean metric

For two dimensions (variables), this is just the hypotenuse of a right-angle triangle…

…while for p dimensions, it is the hypotenuse of a hyper-triangle.

1x

2x

1ix

2ix

1jx

2jx

1 2( , )i ix x

1 2( , )j jx x

ijd

2

2

1

ij ik jk

k

d x x

2

1

p

ij ik jk

k

d x x

feateval Multivariate distances between classes: the

Euclidean distance

Calculates the Euclidean distance between two “points” defined by the multivariate means of two classes of p variables.

Does not take into account differences among classes in within-class variability nor

correlations among variables.

X2

X1

Class 1

Class 2

d

p

k

jkikijdJ1

2

feateval

Page 6: Statistics and classification

6

Inter/intra class distance

A simple measure of class separation is inter/intra class

distance

Assumptions

discriminative information in mean differences

class scatter distribution similar for all classes

}{ 1

ainter/intr

wbSStrJ

feateval Mahalanobis distance

2211

21

1

21sMahalanobi

NN

JT

Similar to inter/intra is a distance measure based on the Gaussian distribution

Assumptions

weigh mean distance by covariance estimate

pooled covariance estimate

Natural extension allowing different covariances

(Bhattacharyya distance)

feateval

Distances between observations

and objects We can also calculate a

distance between an individual observation and some object, where the object may be

another observation or a group mean.

The distance between an

observation and a group can be used to define the probability that the observati on belongs to the group (f.ex. when using the

Mahalanobis distance) X2

X1

Group 1

Group 2

Group mean

Observation

feateval Feature selection

Given a feature set x={x1, x2,…,xn} find a subset

ym={xi1,xi2,…,xim} with m<n which optimizes an objective

function J(Y)

featselm

featselmFeature selection Search strategy

Exhaustive search implies if we fix mand 2n if we need to search all possible mas well.

Choosing 10 out of 100 will result in 1013

queries to J

Obviously we need to guide the search!

Objective function (J)

”Predict” classifier performance

”Predicting” is faster than actual classification

Naïve feature search (individual selection)

Goal: select the two best features individually

Easy to devise a breakdown case Any reasonable objective J will rank the features

J(x1)>J(x2)≈J(x3)>J(x4)

Features chosen will be [x1,x2] or [x1,x2]

However – the only feature that provides complementaryinformation to x1 is x4

Search is ”too greedy”

We need to compare choice with reference to already chosen features

featseli

Page 7: Statistics and classification

7

Forward feature selection

Starting from the empty set, sequentially add the feature x+ that results in the highest objective function J(Yk + x+) when combined with the features Yk that have already been selected

Algorithm1. Start with the empty set Y0 = Ø;2. Select the next best feature 3. Update Yk+1 = Yk + x+; k = k + 1

4. If k less than number of features wanted goto 2

Forward selection performs best when the optimal subset has a small number of features

Forward selection cannot discard features that become obsolete when adding other features

)(maxarg xYJx kYx k

prtoolss

featself Backward feature selection

Starting from the full set, sequentially remove the feature x- that results in the smallest decrease in objective function J(Yk - x-) when combined with the features Yk that are already in the set

Algorithm1. Start with the full set Yk = X;2. Remove the worst feature 3. Update Yk-1 = Yk + x-; k = k - 1

4. If k more than number of features wanted goto 2

Backward selection performs best when the optimal subset has a large number of features

Backward selection cannot re-include features that become necessary when removing other features

Note that the decrease can also be an increase

)(minarg xYJx kYx k

prtoolss

featselb

Floating search (Pudil’s forward)

Starting from the empty set, include features by forward search, then backtrack using backward search until criterion decreases

Algorithm1. Start with the empty set Y0 = Ø;

2. Do a forward step; Yk+1 = Yk + x+; k = k + 1

3. While we can increase criterion J; do backward stepYk-1 = Yk + x-; k = k - 1

3. If k less than number of features wanted goto 2

Can be extremely time-consuming The improvement over other methods somewhat

dependent on the feature set

prtoolss

featselp Feature selection as dimension reduction

In some cases, a linear (or nonlinear combination) of features might be a better choice than using a subset of features

Consider however, that not all transforms are appropriate for dimension reduction for classification

However, feature selection has one

interesting property – we represent the data on a set of dimensions that retain their meaning

Using distance as a criterion It might be tempting to rescale the features

Seems reasonabl e to make features scale-invariant?

For example, scale the data cloud to zero mean and unit

variance

When using euclidean distance as a criterion this might change

the clustering result, which one is the one we want?

Rescaling is not always a good idea, but should be

considered if Euclidean distance is used

What is Cluster Analysis?

Finding groups of objects such that the objects in a

group will be similar (or related) to one another and

different from (or unrelated to) the objects in other

groupsInter-cluster

distances are

maximized

Intra-cluster

distances are

minimized

Page 8: Statistics and classification

8

Notion of a Cluster can be

Ambiguous

How many clusters?

Four ClustersTwo Clusters

Six Clusters

Types of Clusterings

A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering

A division data objects into non-overl apping subsets (clusters) such that each data object is in exactly one subset

Soft partitioning allows objects to participate in several

subsets (clusters)

Hierarchical clustering A set of nested clusters organized as a hierarchical tree

Hierarchical ClusteringConsider a sequence of partitions of the n samples into c clusters

The first is a partition into n cluster, each one containing exactly one sample

The second is a partition into n-1 clusters, the third into n-2, and so on, until the n-th in which there is only one cluster containing all of the

samples

At the level k in the sequence, c = n-k+1.

Data with clustering order and distances

Dendrogramrepresentation

hclust Hierarchical Clustering

Two main types of hierarchical clustering

Agglomerative:

Start with the points as individual clusters

At each step, merge the closest pair of clusters until only one cluster

(or k clusters) left

Divisive:

Start with one, all-inclusive cluster

At each step, split a cluster until each cluster contains a point (or

there are k clusters)

Traditional hierarchical algorithms use a similarity or

distance matrix

Merge or split one cluster at a time

hclust

Generic algorithm – hierarchical clustering

Input: x={x1, x2,…, xn }

Choice of distance metric Δ Merging criterion (also called linkage criterion)

Output: tree (dendrogram) of cluster merges Algorithm

1. Put each datapoint xn in its own cluster2. Join two closest clusters according to merging criterion3. If there is more than one cluster left, go to 2

hclust

plotdg

Hierarchical clustering hclust

Page 9: Statistics and classification

9

Hierarchical clustering hclust Strengths of Hierarchical Clustering

Do not have to assume any particular number of clusters

Any desired number of clusters can be obtained by „cutting‟ the dendogram at the proper level

They may correspond to meaningful taxonomies

Very popular in the life sciences, related genes, similar plants etc

How to Define Inter-Cluster

Similarity

Similarity?

MIN (single link)

MAX (complete link)

Group Average

Distance Between centroids

Other methods driven by an objective

function

Ward’s Method uses squared error

How to Define Inter-Cluster

Similarity

MIN (single link)

MAX (complete link)

Group Average

Distance Between Centroids

Other methods driven by an objective

function

Ward’s Method uses squared error

How to Define Inter-Cluster

Similarity

MIN (single link)

MAX (complete link)

Group Average

Distance Between Centroids

Other methods driven by an objective

function

Ward’s Method uses squared error

How to Define Inter-Cluster

Similarity

MIN (single link)

MAX (complete link)

Group Average

Distance Between Centroids

Other methods driven by an objective

function

Ward’s Method uses squared error

Page 10: Statistics and classification

10

How to Define Inter-Cluster

Similarity

MIN (single link)

MAX (complete link)

Group Average

Distance Between Centroids

Other methods driven by an objective

function

Ward’s Method uses squared error

Cluster Similarity: MIN or Single Link

Similarity of two clusters is based on the two most similar

(closest) points in the different clusters

Determined by one pair of points, i.e., by one link in the proximity

graph.

I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20

I2 0.90 1.00 0.70 0.60 0.50

I3 0.10 0.70 1.00 0.40 0.30

I4 0.65 0.60 0.40 1.00 0.80

I5 0.20 0.50 0.30 0.80 1.001 2 3 4 5

Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

3 6 2 5 4 10

0.05

0.1

0.15

0.2

Strength of MIN

Original Points Two Clusters

• Can handle non-elliptical shapes

Limitations of MIN

Original Points Two Clusters

• Sensitive to noise and outliers

Cluster Similarity: MAX or

Complete Linkage Similarity of two clusters is based on the two least similar

(most distant) points in the different clusters

Determined by all pairs of points in the two clusters

I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20

I2 0.90 1.00 0.70 0.60 0.50

I3 0.10 0.70 1.00 0.40 0.30

I4 0.65 0.60 0.40 1.00 0.80

I5 0.20 0.50 0.30 0.80 1.001 2 3 4 5

Page 11: Statistics and classification

11

Hierarchical Clustering: MAX

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1

2

3

4

5

6

1

2 5

3

4

Strength of MAX

Original Points Two Clusters

• Less susceptible to noise and outliers

Limitations of MAX

Original Points Two Clusters

•Tends to break large clusters

•Biased towards globular clusters

Cluster Similarity: Group Average

Proximity of two clusters is the average of pairwise proximity

between points in the two clusters.

Need to use average connecti vity for scalability since total proximity

favors large clusters

||Cluster||Cluster

)p,pproximity(

)Cluster,Clusterproximity(ji

ClusterpClusterp

ji

jijj

ii

I1 I2 I3 I4 I5

I1 1.00 0.90 0.10 0.65 0.20

I2 0.90 1.00 0.70 0.60 0.50

I3 0.10 0.70 1.00 0.40 0.30

I4 0.65 0.60 0.40 1.00 0.80

I5 0.20 0.50 0.30 0.80 1.001 2 3 4 5

Hierarchical Clustering: Group Average

Nested Clusters Dendrogram

3 6 4 1 2 50

0.05

0.1

0.15

0.2

0.25

1

2

3

4

5

6

1

2

5

3

4

Hierarchical Clustering: Group Average

Compromise between Single and Complete Link

Strengths

Less susceptible to noise and outliers

Limitations

Biased towards globular clusters

Page 12: Statistics and classification

12

Cluster Similarity: Ward’s Method

Similarity of two clusters is based on the increase in

squared error when two clusters are merged

Similar to group average if distance between points is distance

squared

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Can be used to initialize K-means

Hierarchical Clustering: Comparison

Group Average

Ward’s Method

1

2

3

4

5

61

2

5

3

4

MIN MAX

1

2

3

4

5

6

1

2

5

34

1

2

3

4

5

6

1

2 5

3

41

2

3

4

5

6

1

2

3

4

5

Hierarchical Clustering: Time and

memory requirements

O(N2) space since it uses the proximity matrix.

N is the number of points.

O(N3) time in many cases

There are N steps and at each step the size, N2, proximity matrix

must be updated and searched

Complexity can be reduced to O(N2 log(N) ) time for some

approaches

Hierarchical Clustering: Problems

and Limitations

Once a decision is made to combine two clusters, it

cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the

following:

Sensitivity to noise and outliers

Difficulty handling different sized clusters and convex shapes

Breaking large clusters

Partition clustering

Assume we want k classes.

Assume we start with randomly

located cluster centers

n datapoi nts into k classes means

~nk allocations to test iterative

algoritm

General algorithm alternates:

Assignment step: Assign each

datapoint to the closest cluster.

Refitting step: Move each cluster

center to the center of gravity of

the data assigned to it.al

Assignments

Refitted

means

k-means Clustering Each cluster is associated with a centroid (center point)

Each point is assigned to the cluster with the closest centroid

Number of clusters, k, must be specified

The basic algorithm is very simple

kmeans

Page 13: Statistics and classification

13

k-means example

X2

X3

X5

X1

X6

X7

X8

X4

k-means example

X2

X3

X5

X1

X6

X7

X8

X4

μ1

μ3

μ2

Step 1:

Choose k cluster centres, μk(0),

randomly from the available datapoints

kcentres

k-means example

X2

X3

X5

X1

X6

X7

X8

X4

μ1

μ3

μ2

Step 2:

Assign each of the objects in x to

the nearest cluster center μk(i)

)(

..1

)( ,minarg, where, in i

jnkj

i

jnjn xxcx

kmeans k-means example

X2

X3

X5

X1

X6

X7

X8

X4

μ1

μ3

μ2

Step 3:

Recalculate cluster centres μk(i+1)

based on the clustering in iteration i

)(

)(

)1( 1

ijn cx

ni

j

i

j xN

kmeans

k-means example

X2

X3

X5

X1

X7

X8

X4

μ1

μ2

X6

μ3

Step 4:

If the clusters don‟t change;

μk(i+1)≈ μk

(i) (or prespecified number of iterations i reached), terminate, else reassign -

increase iteration i and goto step 2.

kmeans k-means example

X(6)

X2

X5

X1

X7

X8

X4

μ1

μ2

X6

Step 3 in next iteration:

Recalculate cluster centres.

μ3

X3

kmeans

Page 14: Statistics and classification

14

Quality of K-means clustering

Most common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest cluster

To get SSE, we square these errors and sum them.

x is a data point in cluster Ciand mi is the representative point for cluster Ci

can show that mi corresponds to the center (mean) of the cluster

Given two clusterings, we can choose the one with the smallest error

One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering

with higher K

K

i Cx

i

i

xmdistSSE1

2 ),(

Two different K-means

Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal

Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal

Clustering

Original Points

Importance of Choosing Initial

Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Importance of Choosing Initial

Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Importance of Choosing Initial

Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Importance of Choosing Initial

Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Page 15: Statistics and classification

15

Limitations of K-means: Differing

Sizes

Original Points K-means (3 Clusters)

Limitations of K-means: Differing

Density

Original Points K-means (3 Clusters)

Limitations of K-means: Non-

globular Shapes

Original Points K-means (2 Clusters)

Problems with Selecting Initial

Points

If there are K „real‟ clusters then the chance of selecting one centroid from each cluster is small.

Chance is relatively small when K is large

If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/1010 = 0.00036

Sometimes the initial centroids will readjust themselves in „right‟ way, and sometimes they don‟t

Solutions to Initial Centroids Problem

Multiple runs

Helps, but probability is not on your side

Sample and use hierarchical clustering to determine initial centroids

Select more than k initial centroids and then select among these initial centroids

Select most widely separated

Improved k-means with cluster merge and split (ISODATA)

Soft k-means Instead of making hard assignments of data-points to

clusters, we can make soft assignments.

One cluster may have a responsibility of .7 for a data-point

and another may have a responsibility of .3.

Allows a cluster to use more information about the data in the

refitting step.

How do we decide on the soft assignments?

emclust

Page 16: Statistics and classification

16

Probabilistic clustering

Assume a probability distribution for each cluster

We have a dataset that is a mixture of components

Common choice of components is Gaussians, θk={μk,Σk}

Exhaustive search of mixing parameters impossible, usually the EM-algorithm is used

K

k

kkk xpxp1

);()(

emclust The mixture of Gaussians model

First pick one of the k Gaussians with a probability that is

called its “mixing proportion”.

Then generate a random point from the chosen Gaussian.

The probability of generating the exact data we observed

is zero, but we can still try to maximize the probability

density.

Adjust the means of the Gaussians

Adjust the variances of the Gaussians on each

dimension (or use a full covariance Gaussian).

Adjust the mixing proportions of the Gaussians.

emclust

Computing responsibilities

In order to adjust the

parameters, we must first solve the inference problem: Which Gaussian generated

each datapoi nt , x?

We cannot be sure, so it‟s a distribution over all

possibilities.

Use Bayes theorem to get posterior probabiliti es

2,

2,

1 ,

2

||||

2

1)|(

)(

)|()()(

)(

)|()()|(

di

did

kd

d di

i

j

x

ip

ip

jpjpp

p

ipipip

e

x

xx

x

xx

Posterior for

Gaussian i

Prior for

Gaussian i

Mixing proportion

Product over all data dimensions

Computing the new mixing proportions

Each Gaussian gets a

certain amount of

posterior probability for

each datapoint.

The optimal mixing

proportion to use (given

these posterior

probabilities) is just the

fraction of the data that

the Gaussian gets

responsibility for.

N

ipNc

c

c

newi

1

)|( x

Data for

training

case c

Number of

training cases

Posterior for

Gaussian i

Computing the new means

We just take the center-of gravity of the data that the Gaussian is responsible for.

Just like in K-means, except the data is weighted by the posterior probability of

the Gaussian.

Guaranteed to lie in the convex hull of the data

Could be big initial jump

c

c

c

ccnewi

ip

ip

)|(

)|(

x

xx

μ

Computing the new variances

For axis-aligned Gaussians, we just fit the variance of the

Gaussian on each dimension to the posterior-weighted

data

Its more complicated if we use a full-covariance Gaussian that is

not aligned with the axes.

c

cc

newdi

cd

c

diip

μxip

)|(

||||)|( 2,

2,

x

x

Page 17: Statistics and classification

17

How many Gaussians do we use?

Hold back a validation set.

Try various numbers of Gaussians

Pick the number that gives the highest density to the

validation set.

Refinements:

We could make the validation set smaller by using

several different validation sets and averaging the

performance.

We should use all of the data for a final training of the

parameters once we have decided on the best number

of Gaussians.

Non-parametric methods

Arguably, a Gaussian blob-shape might not be appropriate

for class description of all classes

”Let the data describe the model”

We might want to skip modeling the conditional density

and model the aposteriori directly

In effect estimate p(x|ω) or even p(ω|x) without assuming

a specific model

k-Nearest Neighbors

Parzen windows

k-Nearest-Neighbor classification

Allocate a sample to the same class as the majority of the k nearest neighbors in the training set

Classification of a new sample xi is done as follows: Out of N training vectors, identify the k nearest

neighbors (measure by Euclidean distance) in the training set, irrespectively of the class label. k should be odd.

Out of these k samples, identify the number of vectors k i that belong to class i , i:1,2,....M (if we have M classes)

Assign x i to the class i with the maximum number of k i samples.

k must be set by user. (Crossvalidate)

kNN tesselates the data space, i.e., decision boundaries usually polyhedra

knnck-Nearest-Neighbor classification

Using only the closest example to determine the

categorization is subject to errors due to:

A single atypical example.

Noise (i.e. error) in the category label of a single training example.

More robust alternative is to find the k most-similar

examples and return the majority category of these k

examples.

Value of k is typically odd to avoid ties; 3 and 5 are most common

Tradeoffs: want neighborhood x’ to be as small as

possible while k as large as possible

Optimality guaranteed iff k→∞ - but this is impossible in a small

neighborhood unless the number of samples is infinite

knnc

Probabilistic interpretation of k-NN

We estimate the aposteriori

P(ωi|x) by using k neighbors,

P(ωi|x’)

Bias and variance tradeoff A small neighborhood large

variance unreliable estimation

A large neighbor hood large bias inaccurate estimation

knncWhen to Consider k-Nearest Neighbor ?

Lots of training data

Less than 20 features per object

Advantages:

Training is very fast

Learn complex decision boundaries functions

Disadvantages:

Slow at query time

Will be disturbed by irrelevant attributes, sensitive to

scaling

What is optimal k?

knnc

Page 18: Statistics and classification

18

Parzen windows

Instead of using k samples as (”sort of”) a density

estimate – weigh influence by distance of xi to x

We observe a d-dim. window (covering n samples), but

the method needs a choice of window-width hn

If we make sure that our weight function, φ , is a proper

probability density – the estimate will be one as well.

d

nn

n

i n

i

n

n

hV

hVnp

1

11)(

xxx

parzencParzen windows

•The density estimate at point x is a sum of window functions

centered at xi

•Window width defines the ”focus” – when width goes toward

zero – the window function goes toward a delta function

parzenc

Effect of window width

•Toy example – 5 samples in the box•Too wide, lack of resolution in describing the density•Too small, too much variability in the estimate•Convergence to true probability density possible if infinite number of samples

parzencEffect of window width

parzenc

Parzen classifiers

Estimate the aposteriori density by the Parzen window method

Arbitrarily complex densities possible to estimate

Usually, a LOT of samples are needed to avoid overfitting of the training set

Sample need grows ~exponentially with dataset

dimension

Need to decide window width hn

parzendc

What to remember from this lecture

K-means

Understand basic algorithm, overview of soft k-means / probabilistic clustering, sense of some of the pitfalls

Outline of hierarchical clustering Being able to describe algorithmic idea, idea of strengths and weaknesses

Distance measures, feature selection algorithms

Understand basic search strategies

Classification can be done by estimating decisions directly instead of modeling data

For example k-NN and Parzen

Training data is a sparse resource, and should be used efficiently Crossvalidation is usually a good approach for using data to decide parameters of

classifiers

Performance of a classifier can be affected by different types of errors

Samples that should be rejected (doubt and outliers), model mismatch to data, poor model

parameter estimates, need to be more specific than overall error when analyzing results