1 INF4300 Unsupervised classification, classifier evaluation and data exploration Asbjørn Berge 16-09-2009 Today’s plan 2 Motivation for exploring data and reducing dimensionality Feature selection Distance and performance measures Search strategies Unsupervised classification / clustering k-means Hierarchical clustering Probabilistic clustering (Mixture of Gaussians) Classifier performance and errors Estimating error Confusion matrix Training and test dataset, and how to efficiently use your data. Comment on complexity generalization performance tradeoff Outliers / rejection and doubt Other classification techniques (non-parametric methods) k-NN Parzen windows prtools s ) ( ) ( ) | ( ) | ( x x x p P p P j j j Bayesian statistics – Decision making P(data) P(class|data) P(class|data) Pr(class 1)=0.3 Pr(class 2)=0.2 Pr(class 3)=0.5 Classification using a Gaussian model Train the classifier by estimating i and Σj for each class Classifying a new sample: Compute for each class the conditional probability density: Compute the posterior probability Assign the label correspondi ng to the class with the highest posterior probability s class to belonging samples training all over is sum the where t s m M m s m s s M m m s s x x M x M s s ˆ ˆ 1 ˆ , 1 ˆ 1 1 s s t s s P s x x x p 1 2 / 1 2 / 2 1 exp 2 1 ) | ( ) ( ) | ( | s s s P x p x P normc qdc Two special cases for covariance Diagonal covariance matrices, Σj =σ 2 I Discriminant functions are linear functions With P features, estimate i (1xP vector) for each class. Classes can be thought of as hyperspheres Class-specific covariance matrices, Σj arbitrary Discriminant functions are quadratic functions With P features, estimate i (1xP vector) Σj is PxP matrix with P(P-1)/2 unique elements Classes can be thought of as hyperellipsoides nmc qdc The ”curse” of dimensionality Very simple example, three class classification problem, 9 samples Divide the space into bins and classify to majority
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
INF4300
Unsupervised classification,
classifier evaluation and
data exploration
Asbjørn Berge 16-09-2009
Today’s plan
2
Motivation for exploring data and reducing dimensionality Feature selection
Faced with model choices. Need performance metrics
(error measures) and methods to evaluate these.
Feature evaluation and selection Why reduce the number of features we use to describe the
data?
Countering overfitting More room for samples to reside in, when dataset
dimensionality increases
Reducing variance in parameter estimates Want to use as many samples as possible to estimate each
parameter in the model
In the extreme case – make estimates numerically stable Samples ~ dimensionali ty of data means that e.g. covariance
matrix at risk of being singular and impossible to invert
Common rule of thumb To get reasonable estimates we need a number of samples 5-
10 times the dimensionali ty
Evaluating classification performance
To choose the best classifier for a task, we need to define
some metrics.
There is no superior classifier for all kinds of problems, so
we‟re stuck with using heuristics to make a choice.
Some classification approaches have parameters we can
tune using these heuristics
We would like to know what kind of performance we can
expect on new (unseen) data
3
Overall error
One common way of defining the quality is overall error rate
Usually, it is a weighted average of errors from each class weighted by class prior
What would happen if?
We define error as number of correct samples ratioed by the
total number of samples?
Our prior estimates are ”wrong”?
E.g. “my classifier gives right answers 80% of the time” Is it
good? Why (not)? What happens if 80% of the data has the
“N” label and my classifier always say “N”?
j
jP
samples) total(#
)(*) class from samplesincorrect (# j
testc Confusion matrix
A convenient way of evaluating classifiers – avoiding such
pitfalls - is the confusion matrix
Plot the true class labels versus the class labels assigned
by the classificator
From this we can read the distribution of incorrectly
classified samples
confmat
Confusion matrix
Class ω1 Class ω2 Class ω3 Total
Class ω1 80 15 5 100
Class ω2 5 140 5 150
Class ω3 25 50 125 200
Total 110 205 135 450
True class labels
Cla
ssific
ation
confmat
Confusion matrix – derived measures of
classification accuracyClas s ω1 Clas s ω2 Clas s ω3 Tota l
Clas s ω1 80 15 5 100
Clas s ω2 5 140 5 150
Clas s ω3 25 50 125 200
Tota l 110 205 135 450
True class labels
Cla
ssific
ation
Overall accuracy
Makes sense to evaluate normalized by (true) classsize
!! Many researchers do not, however!
Precision How accurate (precise) is the classification on each
class?
#”correct label ω”/#”total classified label ω”
Look at rows
Recall
What is the chance of choosing correctly within eachclass?
#”correct label ω”/#”total true label ω”
Calculate from the columns
Kappa
How much better is the classifier than randomguessing?
Compare diagonal of your confusion matrix with onedue to random chance
confmat
ji ij
i ii
,
Accuracy
i
iii
)Precision(
i
iii
)Recall(
i iiji ij
ijji jii ii
rP
rPcP
,
,/*
)(1
)()(
Outliers and doubt
Two rather vague errors in a classification problem is outliers and doubt samples
We might want an ideal classifier to report
‟this sample is from class l‟ (usual case)
‟this sample is not from any of the classes‟ (outlier)
‟this sample is too hard for me‟ (doubt/reject)
The two last cases should lead to an rejection of the sample
rejectcOutliers
Outliers are heuristically defined as ”..samples which did not (or are thought not to have) come from the assumed population of samples”
The outliers can result from some breakdown in preprocessi ng (or even before we aquire an image)
One way to deal with outliers is to model them as an own class, for example a gaussian with a very large variance, and estimate prior probability from the training data
Another approach is to decide on some threshold on the aposteriori – and if a sample falls below this threshold for all classes, then declare it an outlier.
rejectc
4
Doubt samples
Doubt samples are samples for which the class with the highest probability is not significantly more probable than some of the other classes (e.g. two classes have essentially equal probability).
Classify as doubt if p(x| i)P( i) < 1-c, where c is given by the user.
c must be in the range [0, K-1/K] if we have K classes.
Some classification software can allow the user to specify thresholds for doubt
Other software choose the simpler solution of just guessing
rejectc
Training and test dataset, and how to
efficiently use your data.
In the ideal case we want to maximize the size of the
training and test dataset
Obviously there is a fixed amount of available data with
known labels
A very simple approach is to separate the dataset in two
random subsets, but we can do better!
The number of features for each object is an important
factor with regards to the amount of available data (further
on this next lecture)
Back to good use of training data
“Hold out”, ok for large (>1000 objects) datasets
Simply put away a part of the training data, say 1/3 of the
samples chosen randomly – train on the 2/3 remaining,
and evaluate classifier performance (error and so on) on
the 1/3.
Can repeat this a couple of times, and report the average
of repetitions.
Problem: repeated draws overlap
gendat
x1 … x2 … xp class
o2
o1
on
Train
Test
x1 … x2 … xp class
o2
o1
on
Crossvalidation / Leave – n - Out
A very simple (but computationally complex) improvement
on the hold-out
Train the classifier on a set of N-n samples
Divide the dataset into blocks of n samples
Test the classifier on the n remaining samples
Repeat n/N times (dependent on subsampli ng) rotating through
data
Report average performance on the repeated experiments
crossval
x1 … x2 … xp class
o2
o1
on
Train
Test
x1 … x2 … xp class
o2
o1
on
How many blocks to divide the data in?
More is usually better, but trade-off with computati onal complexity
Usually five or ten blocks is used, often denoted 5-CV, 10-CV
Average, and spread, of classification error can be reported
Can be designed to guarantee samples from each class.
(Stratification)
The logical extreme of crossvalidation is to leave only one
sample out each repetition
Extremely time consuming
Since all samples are visited once , no bias from random
subsampling
Stratification impossible
Crossvalidation / Leave – n - Outcrossval
Exploratory data analysis
For a small number of features,
manual data analysis to study
the features is recommended.
Choose intelligent features.
Evaluate e.g.
Error rates for single-featur e
classification
Scatter plots
Scatter plots of featurecombinations
scatterdui
5
What are good features?
Clearly, we need to choose good features
How do we quantify feature quality?
A good feature is simple to ”learn”
This is often related to class separation
Class separation
Measure distance between all points or just class means?
Many distance measures are ”pairwise”
Use average or minimum?
All these distance measures can be represented as a scalar J, also called an ”objective function”
Typical class separation measures
Euclidean distance distance between pair of means
Mahalanobis distance sometimes called statistical distance
distance between pair of classes weighed by probability density
Inter/intra class distance Measure ratio of distance between class means and class ”size”
Classifier accuracy How good does a classifier perform on the dataset?
Evaluate with hold-out or cross-validati on
12 13
21 23
31 32
0
0
0
d d
d d
d d
D
Distance matrices
Once a distance measure is defined, we can calculate the distance between objects. These objects could be individual observations, groups of observations (classes)
For N objects, we then have a symmetric distance matrix D whose elements are distances between objects i and j.
1
23
d12
feateval
Euclidean distance
A possible distance measure for spaces equipped with a Euclidean metric
For two dimensions (variables), this is just the hypotenuse of a right-angle triangle…
…while for p dimensions, it is the hypotenuse of a hyper-triangle.
1x
2x
1ix
2ix
1jx
2jx
1 2( , )i ix x
1 2( , )j jx x
ijd
2
2
1
ij ik jk
k
d x x
2
1
p
ij ik jk
k
d x x
feateval Multivariate distances between classes: the
Euclidean distance
Calculates the Euclidean distance between two “points” defined by the multivariate means of two classes of p variables.
Does not take into account differences among classes in within-class variability nor
correlations among variables.
X2
X1
Class 1
Class 2
d
p
k
jkikijdJ1
2
feateval
6
Inter/intra class distance
A simple measure of class separation is inter/intra class
distance
Assumptions
discriminative information in mean differences
class scatter distribution similar for all classes
}{ 1
ainter/intr
wbSStrJ
feateval Mahalanobis distance
2211
21
1
21sMahalanobi
NN
JT
Similar to inter/intra is a distance measure based on the Gaussian distribution
Assumptions
weigh mean distance by covariance estimate
pooled covariance estimate
Natural extension allowing different covariances
(Bhattacharyya distance)
feateval
Distances between observations
and objects We can also calculate a
distance between an individual observation and some object, where the object may be
another observation or a group mean.
The distance between an
observation and a group can be used to define the probability that the observati on belongs to the group (f.ex. when using the
Mahalanobis distance) X2
X1
Group 1
Group 2
Group mean
Observation
feateval Feature selection
Given a feature set x={x1, x2,…,xn} find a subset
ym={xi1,xi2,…,xim} with m<n which optimizes an objective
function J(Y)
featselm
featselmFeature selection Search strategy
Exhaustive search implies if we fix mand 2n if we need to search all possible mas well.
Choosing 10 out of 100 will result in 1013
queries to J
Obviously we need to guide the search!
Objective function (J)
”Predict” classifier performance
”Predicting” is faster than actual classification
Naïve feature search (individual selection)
Goal: select the two best features individually
Easy to devise a breakdown case Any reasonable objective J will rank the features
J(x1)>J(x2)≈J(x3)>J(x4)
Features chosen will be [x1,x2] or [x1,x2]
However – the only feature that provides complementaryinformation to x1 is x4
Search is ”too greedy”
We need to compare choice with reference to already chosen features
featseli
7
Forward feature selection
Starting from the empty set, sequentially add the feature x+ that results in the highest objective function J(Yk + x+) when combined with the features Yk that have already been selected
Algorithm1. Start with the empty set Y0 = Ø;2. Select the next best feature 3. Update Yk+1 = Yk + x+; k = k + 1
4. If k less than number of features wanted goto 2
Forward selection performs best when the optimal subset has a small number of features
Forward selection cannot discard features that become obsolete when adding other features
)(maxarg xYJx kYx k
prtoolss
featself Backward feature selection
Starting from the full set, sequentially remove the feature x- that results in the smallest decrease in objective function J(Yk - x-) when combined with the features Yk that are already in the set
Algorithm1. Start with the full set Yk = X;2. Remove the worst feature 3. Update Yk-1 = Yk + x-; k = k - 1
4. If k more than number of features wanted goto 2
Backward selection performs best when the optimal subset has a large number of features
Backward selection cannot re-include features that become necessary when removing other features
Note that the decrease can also be an increase
)(minarg xYJx kYx k
prtoolss
featselb
Floating search (Pudil’s forward)
Starting from the empty set, include features by forward search, then backtrack using backward search until criterion decreases
Algorithm1. Start with the empty set Y0 = Ø;
2. Do a forward step; Yk+1 = Yk + x+; k = k + 1
3. While we can increase criterion J; do backward stepYk-1 = Yk + x-; k = k - 1
3. If k less than number of features wanted goto 2
Can be extremely time-consuming The improvement over other methods somewhat
dependent on the feature set
prtoolss
featselp Feature selection as dimension reduction
In some cases, a linear (or nonlinear combination) of features might be a better choice than using a subset of features
Consider however, that not all transforms are appropriate for dimension reduction for classification
However, feature selection has one
interesting property – we represent the data on a set of dimensions that retain their meaning
Using distance as a criterion It might be tempting to rescale the features
Seems reasonabl e to make features scale-invariant?
For example, scale the data cloud to zero mean and unit
variance
When using euclidean distance as a criterion this might change
the clustering result, which one is the one we want?
Rescaling is not always a good idea, but should be
considered if Euclidean distance is used
What is Cluster Analysis?
Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groupsInter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
8
Notion of a Cluster can be
Ambiguous
How many clusters?
Four ClustersTwo Clusters
Six Clusters
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and partitional sets of clusters
Partitional Clustering
A division data objects into non-overl apping subsets (clusters) such that each data object is in exactly one subset
Soft partitioning allows objects to participate in several
subsets (clusters)
Hierarchical clustering A set of nested clusters organized as a hierarchical tree
Hierarchical ClusteringConsider a sequence of partitions of the n samples into c clusters
The first is a partition into n cluster, each one containing exactly one sample
The second is a partition into n-1 clusters, the third into n-2, and so on, until the n-th in which there is only one cluster containing all of the
samples
At the level k in the sequence, c = n-k+1.
Data with clustering order and distances
Dendrogramrepresentation
hclust Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:
Start with the points as individual clusters
At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or
there are k clusters)
Traditional hierarchical algorithms use a similarity or
distance matrix
Merge or split one cluster at a time
hclust
Generic algorithm – hierarchical clustering
Input: x={x1, x2,…, xn }
Choice of distance metric Δ Merging criterion (also called linkage criterion)
Output: tree (dendrogram) of cluster merges Algorithm
1. Put each datapoint xn in its own cluster2. Join two closest clusters according to merging criterion3. If there is more than one cluster left, go to 2
hclust
plotdg
Hierarchical clustering hclust
9
Hierarchical clustering hclust Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
Any desired number of clusters can be obtained by „cutting‟ the dendogram at the proper level
They may correspond to meaningful taxonomies
Very popular in the life sciences, related genes, similar plants etc
How to Define Inter-Cluster
Similarity
Similarity?
MIN (single link)
MAX (complete link)
Group Average
Distance Between centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
How to Define Inter-Cluster
Similarity
MIN (single link)
MAX (complete link)
Group Average
Distance Between Centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
How to Define Inter-Cluster
Similarity
MIN (single link)
MAX (complete link)
Group Average
Distance Between Centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
How to Define Inter-Cluster
Similarity
MIN (single link)
MAX (complete link)
Group Average
Distance Between Centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
10
How to Define Inter-Cluster
Similarity
MIN (single link)
MAX (complete link)
Group Average
Distance Between Centroids
Other methods driven by an objective
function
Ward’s Method uses squared error
Cluster Similarity: MIN or Single Link
Similarity of two clusters is based on the two most similar
(closest) points in the different clusters
Determined by one pair of points, i.e., by one link in the proximity
graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.001 2 3 4 5
Hierarchical Clustering: MIN
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
Strength of MIN
Original Points Two Clusters
• Can handle non-elliptical shapes
Limitations of MIN
Original Points Two Clusters
• Sensitive to noise and outliers
Cluster Similarity: MAX or
Complete Linkage Similarity of two clusters is based on the two least similar
(most distant) points in the different clusters
Determined by all pairs of points in the two clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.001 2 3 4 5
11
Hierarchical Clustering: MAX
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
6
1
2 5
3
4
Strength of MAX
Original Points Two Clusters
• Less susceptible to noise and outliers
Limitations of MAX
Original Points Two Clusters
•Tends to break large clusters
•Biased towards globular clusters
Cluster Similarity: Group Average
Proximity of two clusters is the average of pairwise proximity
between points in the two clusters.
Need to use average connecti vity for scalability since total proximity
favors large clusters
||Cluster||Cluster
)p,pproximity(
)Cluster,Clusterproximity(ji
ClusterpClusterp
ji
jijj
ii
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.001 2 3 4 5
Hierarchical Clustering: Group Average
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
1
2
3
4
5
6
1
2
5
3
4
Hierarchical Clustering: Group Average
Compromise between Single and Complete Link
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular clusters
12
Cluster Similarity: Ward’s Method
Similarity of two clusters is based on the increase in
squared error when two clusters are merged
Similar to group average if distance between points is distance
squared
Less susceptible to noise and outliers
Biased towards globular clusters
Hierarchical analogue of K-means
Can be used to initialize K-means
Hierarchical Clustering: Comparison
Group Average
Ward’s Method
1
2
3
4
5
61
2
5
3
4
MIN MAX
1
2
3
4
5
6
1
2
5
34
1
2
3
4
5
6
1
2 5
3
41
2
3
4
5
6
1
2
3
4
5
Hierarchical Clustering: Time and
memory requirements
O(N2) space since it uses the proximity matrix.
N is the number of points.
O(N3) time in many cases
There are N steps and at each step the size, N2, proximity matrix
must be updated and searched
Complexity can be reduced to O(N2 log(N) ) time for some
approaches
Hierarchical Clustering: Problems
and Limitations
Once a decision is made to combine two clusters, it
cannot be undone
No objective function is directly minimized
Different schemes have problems with one or more of the
following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and convex shapes
Breaking large clusters
Partition clustering
Assume we want k classes.
Assume we start with randomly
located cluster centers
n datapoi nts into k classes means
~nk allocations to test iterative
algoritm
General algorithm alternates:
Assignment step: Assign each
datapoint to the closest cluster.
Refitting step: Move each cluster
center to the center of gravity of
the data assigned to it.al
Assignments
Refitted
means
k-means Clustering Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, k, must be specified
The basic algorithm is very simple
kmeans
13
k-means example
X2
X3
X5
X1
X6
X7
X8
X4
k-means example
X2
X3
X5
X1
X6
X7
X8
X4
μ1
μ3
μ2
Step 1:
Choose k cluster centres, μk(0),
randomly from the available datapoints
kcentres
k-means example
X2
X3
X5
X1
X6
X7
X8
X4
μ1
μ3
μ2
Step 2:
Assign each of the objects in x to
the nearest cluster center μk(i)
)(
..1
)( ,minarg, where, in i
jnkj
i
jnjn xxcx
kmeans k-means example
X2
X3
X5
X1
X6
X7
X8
X4
μ1
μ3
μ2
Step 3:
Recalculate cluster centres μk(i+1)
based on the clustering in iteration i
)(
)(
)1( 1
ijn cx
ni
j
i
j xN
kmeans
k-means example
X2
X3
X5
X1
X7
X8
X4
μ1
μ2
X6
μ3
Step 4:
If the clusters don‟t change;
μk(i+1)≈ μk
(i) (or prespecified number of iterations i reached), terminate, else reassign -
increase iteration i and goto step 2.
kmeans k-means example
X(6)
X2
X5
X1
X7
X8
X4
μ1
μ2
X6
Step 3 in next iteration:
Recalculate cluster centres.
μ3
X3
kmeans
14
Quality of K-means clustering
Most common measure is Sum of Squared Error (SSE)
For each point, the error is the distance to the nearest cluster
To get SSE, we square these errors and sum them.
x is a data point in cluster Ciand mi is the representative point for cluster Ci
can show that mi corresponds to the center (mean) of the cluster
Given two clusterings, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering
with higher K
K
i Cx
i
i
xmdistSSE1
2 ),(
Two different K-means
Clusterings
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal
Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal
Clustering
Original Points
Importance of Choosing Initial
Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Importance of Choosing Initial
Centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Importance of Choosing Initial
Centroids …
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Importance of Choosing Initial
Centroids …
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
15
Limitations of K-means: Differing
Sizes
Original Points K-means (3 Clusters)
Limitations of K-means: Differing
Density
Original Points K-means (3 Clusters)
Limitations of K-means: Non-
globular Shapes
Original Points K-means (2 Clusters)
Problems with Selecting Initial
Points
If there are K „real‟ clusters then the chance of selecting one centroid from each cluster is small.
Chance is relatively small when K is large
If clusters are the same size, n, then
For example, if K = 10, then probability = 10!/1010 = 0.00036
Sometimes the initial centroids will readjust themselves in „right‟ way, and sometimes they don‟t
Solutions to Initial Centroids Problem
Multiple runs
Helps, but probability is not on your side
Sample and use hierarchical clustering to determine initial centroids
Select more than k initial centroids and then select among these initial centroids
Select most widely separated
Improved k-means with cluster merge and split (ISODATA)
Soft k-means Instead of making hard assignments of data-points to
clusters, we can make soft assignments.
One cluster may have a responsibility of .7 for a data-point
and another may have a responsibility of .3.
Allows a cluster to use more information about the data in the
refitting step.
How do we decide on the soft assignments?
emclust
16
Probabilistic clustering
Assume a probability distribution for each cluster
We have a dataset that is a mixture of components
Common choice of components is Gaussians, θk={μk,Σk}
Exhaustive search of mixing parameters impossible, usually the EM-algorithm is used
K
k
kkk xpxp1
);()(
emclust The mixture of Gaussians model
First pick one of the k Gaussians with a probability that is
called its “mixing proportion”.
Then generate a random point from the chosen Gaussian.
The probability of generating the exact data we observed
is zero, but we can still try to maximize the probability
density.
Adjust the means of the Gaussians
Adjust the variances of the Gaussians on each
dimension (or use a full covariance Gaussian).
Adjust the mixing proportions of the Gaussians.
emclust
Computing responsibilities
In order to adjust the
parameters, we must first solve the inference problem: Which Gaussian generated
each datapoi nt , x?
We cannot be sure, so it‟s a distribution over all
possibilities.
Use Bayes theorem to get posterior probabiliti es
2,
2,
1 ,
2
||||
2
1)|(
)(
)|()()(
)(
)|()()|(
di
did
kd
d di
i
j
x
ip
ip
jpjpp
p
ipipip
e
x
xx
x
xx
Posterior for
Gaussian i
Prior for
Gaussian i
Mixing proportion
Product over all data dimensions
Computing the new mixing proportions
Each Gaussian gets a
certain amount of
posterior probability for
each datapoint.
The optimal mixing
proportion to use (given
these posterior
probabilities) is just the
fraction of the data that
the Gaussian gets
responsibility for.
N
ipNc
c
c
newi
1
)|( x
Data for
training
case c
Number of
training cases
Posterior for
Gaussian i
Computing the new means
We just take the center-of gravity of the data that the Gaussian is responsible for.
Just like in K-means, except the data is weighted by the posterior probability of
the Gaussian.
Guaranteed to lie in the convex hull of the data
Could be big initial jump
c
c
c
ccnewi
ip
ip
)|(
)|(
x
xx
μ
Computing the new variances
For axis-aligned Gaussians, we just fit the variance of the
Gaussian on each dimension to the posterior-weighted
data
Its more complicated if we use a full-covariance Gaussian that is
not aligned with the axes.
c
cc
newdi
cd
c
diip
μxip
)|(
||||)|( 2,
2,
x
x
17
How many Gaussians do we use?
Hold back a validation set.
Try various numbers of Gaussians
Pick the number that gives the highest density to the
validation set.
Refinements:
We could make the validation set smaller by using
several different validation sets and averaging the
performance.
We should use all of the data for a final training of the
parameters once we have decided on the best number
of Gaussians.
Non-parametric methods
Arguably, a Gaussian blob-shape might not be appropriate
for class description of all classes
”Let the data describe the model”
We might want to skip modeling the conditional density
and model the aposteriori directly
In effect estimate p(x|ω) or even p(ω|x) without assuming
a specific model
k-Nearest Neighbors
Parzen windows
k-Nearest-Neighbor classification
Allocate a sample to the same class as the majority of the k nearest neighbors in the training set
Classification of a new sample xi is done as follows: Out of N training vectors, identify the k nearest
neighbors (measure by Euclidean distance) in the training set, irrespectively of the class label. k should be odd.
Out of these k samples, identify the number of vectors k i that belong to class i , i:1,2,....M (if we have M classes)
Assign x i to the class i with the maximum number of k i samples.
k must be set by user. (Crossvalidate)
kNN tesselates the data space, i.e., decision boundaries usually polyhedra
knnck-Nearest-Neighbor classification
Using only the closest example to determine the
categorization is subject to errors due to:
A single atypical example.
Noise (i.e. error) in the category label of a single training example.
More robust alternative is to find the k most-similar
examples and return the majority category of these k
examples.
Value of k is typically odd to avoid ties; 3 and 5 are most common
Tradeoffs: want neighborhood x’ to be as small as
possible while k as large as possible
Optimality guaranteed iff k→∞ - but this is impossible in a small
neighborhood unless the number of samples is infinite
knnc
Probabilistic interpretation of k-NN
We estimate the aposteriori
P(ωi|x) by using k neighbors,
P(ωi|x’)
Bias and variance tradeoff A small neighborhood large
variance unreliable estimation
A large neighbor hood large bias inaccurate estimation
knncWhen to Consider k-Nearest Neighbor ?
Lots of training data
Less than 20 features per object
Advantages:
Training is very fast
Learn complex decision boundaries functions
Disadvantages:
Slow at query time
Will be disturbed by irrelevant attributes, sensitive to
scaling
What is optimal k?
knnc
18
Parzen windows
Instead of using k samples as (”sort of”) a density
estimate – weigh influence by distance of xi to x
We observe a d-dim. window (covering n samples), but
the method needs a choice of window-width hn
If we make sure that our weight function, φ , is a proper
probability density – the estimate will be one as well.
d
nn
n
i n
i
n
n
hV
hVnp
1
11)(
xxx
parzencParzen windows
•The density estimate at point x is a sum of window functions
centered at xi
•Window width defines the ”focus” – when width goes toward
zero – the window function goes toward a delta function
parzenc
Effect of window width
•Toy example – 5 samples in the box•Too wide, lack of resolution in describing the density•Too small, too much variability in the estimate•Convergence to true probability density possible if infinite number of samples
parzencEffect of window width
∞
parzenc
Parzen classifiers
Estimate the aposteriori density by the Parzen window method
Arbitrarily complex densities possible to estimate
Usually, a LOT of samples are needed to avoid overfitting of the training set
Sample need grows ~exponentially with dataset
dimension
Need to decide window width hn
parzendc
What to remember from this lecture
K-means
Understand basic algorithm, overview of soft k-means / probabilistic clustering, sense of some of the pitfalls
Outline of hierarchical clustering Being able to describe algorithmic idea, idea of strengths and weaknesses
Distance measures, feature selection algorithms
Understand basic search strategies
Classification can be done by estimating decisions directly instead of modeling data
For example k-NN and Parzen
Training data is a sparse resource, and should be used efficiently Crossvalidation is usually a good approach for using data to decide parameters of
classifiers
Performance of a classifier can be affected by different types of errors
Samples that should be rejected (doubt and outliers), model mismatch to data, poor model
parameter estimates, need to be more specific than overall error when analyzing results