Last lecture summary. Test-data and Cross Validation.

Last lecture summary

Test-data and Cross Validation

training error

testing error

model complexity

Test set method• Split the data set into training and test data sets.• Common ration – 70:30• Train the algorithm on training set, assess its

performance on the test set.• Disadvantages– This is simple, however it wastes data.– Test set estimator of performance has high variance

adopted from Cross Validation tutorial, Andrew Moorehttp://www.autonlab.org/tutorials/overfit.html

Train Test

• stratified division– same proportion of data in the training and test

sets

• Training error can not be used as an indicator of model’s performance due to overfitting.

• Training data set - train a range of models, or a given model with a range of values for its parameters.

• Compare them on independent data – Validation set.– If the model design is iterated many times, then

some overfitting to the validation data can occur and so it may be necessary to keep aside a third

• Test set on which the performance of the selected model is finally evaluated.

LOOCV

1. choose one data point2. remove it from the set3. fit the remaining data points4. note your error using the removed data point as

test

Repeat these steps for all points. When you are done report the mean square error (in case of regression).

k-fold crossvalidation

1. randomly break data into k partitions2. remove one partition from the set3. fit the remaining data points4. note your error using the removed partition as test

data set

Repeat these steps for all partitions. When you are done report the mean square error (in case of regression).

Selection and testing• Complete procedure to algorithm selection and

estimation of its quality1. Divide data to train/test

2. By Cross Validation on the Train choose the algorithm

3. Use this algorithm to construct a classifier using Train

4. Estimate its quality on the Test

Train Test

Train

Test

Train Val

Model selection via CV

degree MSEtrain MSE10-fold Choice

1

2

3

4

5

6

adop

ted

from

Cro

ss V

alid

ation

tuto

rial b

y An

drew

Moo

re, h

ttp:

//w

ww

.aut

onla

b.or

g/tu

toria

ls/o

verfi

t.htm

l

polynomial regression

Nearest Neighbors Classification

instances

• Similarity sij is quantity that reflects the strength of relationship between two objects or two features.

• Distance dij measures dissimilarity– Dissimilarity measure the discrepancy between

the two objects based on several features.– Distance satisfies the following conditions:• distance is always positive or zero (dij ≥ 0)• distance is zero if and only if it measured to itself• distance is symmetric (dij = dji)

– In addition, if distance satisfies triangular inequality |x+y| ≤ |x|+|y|, then it is called metric.

Distances for quantitative variables

• Minkowski distance (Lp norm)

• distance matrix – matrix with all pairwise distances

1

np

pp i i

i

L x y

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

x1

x2

y1

y2

11

,n

i ii

L d x y x y

Manhattan distance

x1

x2

y1

y2

Euclidean distance

221

,n

i ii

L d x y x y

k-NN

• supervised learning• target function f may be– dicrete-valued (classification)– real-valued (regression)

• We assign to the class which instance is most similar to the given point.

• k-NN is a lazy learner• lazy learning– generalization beyond the training data is delayed

until a query is made to the system– opposed to eager learning – system tries to

generalize the training data before receiving queries

Which k is best?

Hastie et al., Elements of Statistical Learning

k = 1 k = 15

fitting noise, outliersoverfitting

value not too small smooth out distinctive behavior

Crossvalidation

Real-valued target function

• Algorithm calculates the mean value of the k nearest training examples.

value = 12

value = 14

value = 10

value = (12+14+10)/3 = 12

k = 3

Distance-weighted NN• Give greater weight to closer neighbors

k = 4 unweighted• 2 votes • 2 votes

weighted• 1/12 + 1/22 = 1.25 votes• 1/42 + 1/52 = 0.102 votes

1

2

4

5

k-NN issues

• Curse of dimensionality is a problem.• Significant computation may be required to

process each new query.• To find nearest neighbors one has to evaluate

full distance matrix.• Efficient indexing of stored training examples

helps– kd-tree

Cluster Analysis

• We have data, we don’t know classes.

• Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.

• We have data, we don’t know classes.

• Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.

On clustering validation techniques, M. Halkidi, Y. Batistakis, M. Vazirgiannis

Stages of clustering process

How would you solve the problem?

• How to find clusters? • Group together most similar patterns.

Single linkage(metoda nejbližšího souseda)

based on A Tutorial on Clustering Algorithms http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

BA FL MI/TO NA RM

MI/TO 0

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

996877

BA FL MI/TO NA RM

MI/TO 0877

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

400

295

295877

BA FL MI/TO NA RM

MI/TO 0

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

869

754

BA FL MI/TO NA RM

MI/TO 0295877 754

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

669

564

BA FL MI/TO NA RM

MI/TO 0 754295877 564

Torino

Milano

Florence

Rome

NaplesBari

BA FL MI/TO NA RM

BA 0 662 877 255 412

FL 662 0 295 468 268

MI/TO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

Torino

Milano

Florence

Rome

NaplesBari

BA FL MI/TO NA/RM

BA 0 662 877 255

FL 662 0 295 268

MI/TO 877 295 0 564

NA/RM 255 268 564 0

Torino

Milano

Florence

Rome

NaplesBari

BA/NA/RM FL MI/TO

BA/NA/RM 0 268 564

FL 268 0 295

MI/TO 564 295 0

Torino

Milano

Florence

Rome

NaplesBari

BA/FL/NA/RM MI/TO

BA/FL/NA/RM 0 295

MI/TO 295 0

Torino → Milano Rome → Naples

→ Bari → Florence

Join Torino–Milano and Rome–Naples–Bari–Florence

Dendrogram

MI TOBA NA RM FL

dissimilarity

DendrogramTorino → Milano (138)Rome → Naples (219)

→ Bari (255)→ Florence (268)

Join Torino–Milano and Rome–Naples–Bari–Florence (295)

219 138

255

268

295

MI TOBA NA RM FLdissim

ilarity

Torino

Milano

Florence

Rome

NaplesBari

Torino

Milano

Florence

Rome

NaplesBari

Torino

Milano

Florence

Rome

NaplesBari

Complete linkage(metoda nejvzdálenějšího souseda)

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

BA FL MI/TO NA RM

MI/TO 0

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

996877

BA FL MI/TO NA RM

MI/TO 0996

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

400

295

BA FL MI/TO NA RM

MI/TO 0996 400

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

869

754

BA FL MI/TO NA RM

MI/TO 0996 400 869

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

669

564

BA FL MI/TO NA RM

MI/TO 0996 400 869 669

Torino

Milano

Florence

Rome

NaplesBari

BA FL MI/TO NA RM

BA 0 662 996 255 412

FL 662 0 400 468 268

MI/TO 996 400 0 869 669

NA 255 468 869 0 219

RM 412 268 669 219 0

Torino

Milano

Florence

Rome

NaplesBari

BA FL MI/TO NA/RM

BA 0 662 996 412

FL 662 0 400 468

MI/TO 996 400 0 869

NA/RM 412 468 869 0

Torino

Milano

Florence

Rome

NaplesBari

BA MI/TO/FL NA/RM

BA 0 996 412

MI/TO/FL 996 0 869

NA/RM 412 869 0

MI TOBA NA RM FL

MI TOBA NA RM FL

complete linkage

single linkage

Average linkage(metoda průměrné vazby)

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

996877

BA FL MI/TO NA RM

MI/TO 0936.5

(996+877)/2=936.5

Centroid linkage

Torino

Milano

Florence

Rome


BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

895

BA FL MI/TO NA RM

MI/TO 0895

cluster is represented by its centroid

Similarity?

• single linkage (MIN)• complete linkage (MAX)• average linkage• centroids

Summary


Summary


Summary


Summary


Summary

Ward’s linkage (method)

In Ward’s method metrics are not used, they do not have to be chosen. Instead, sums of squares

(i.e. squared Euclidean distances) between centroids of clusters are computed.

• Ward's method says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them.

• At the beginning of clustering, the sum of squares starts out at zero (because every point is in its own cluster) and then grows as we merge clusters.

• Ward‘s method keeps this growth as small as possible.

• hierarchical– groups data with a sequence of nested partitions

• agglomerative– bottom-up– Start with each data point as one cluster, join the clusters up to the

situation when all points form one cluster.• divisive

– top-down– Initially all objects are in one cluster, then the cluster is

subdivided into smaller and smaller pieces.

• partitional– divides data points into some prespecified number of

clusters without the hierarchical structure– i.e. divides the space

Types of clustering

Hierarchical clustering

• Agglomerative methods are used more widely.• Divisive methods need to consider (2N − 1 −1)

possible subset divisions, which is very computationally intensive. – computational difficulties of finding the optimum

partitions• Divisive clustering methods are better at

finding large clusters than hierarchical methods.

Hierarchical clustering

• Disadvantages– High computational complexity – at least O(N2).• Needs to calculate all mutual distances.

– Inability to adjust once the splitting or merging is performed • no undo

k-means

• How to avoid the computing of all mutual distances?

• Calculate distances from representatives (centroids) of clusters.

• Advantage: number of centroids is much lower than the number of data points.

• Disadvantage: number of centroids k must be given in advance

k-means – kids algorithm

• Once there was a land with N houses.• One day K kings arrived to this land.• Each house was taken by the nearest king.• But the community wanted their king to be at the

center of the village, so the throne was moved there.

• Then the kings realized that some houses were closer to them now, so they took those houses, but they lost some.. This went on and on…

• Until one day they couldn't move anymore, so they settled down and lived happily ever after in their village.

• decide on the number of clusters k• randomly initialize k centroids• repeat until convergence (centroids do not

move)– assign each point to the cluster represented by

the centroid it is nearest to– move the centroids to the position given as a

mean of all points in the cluster

k-means – adults algorithm

k-means applet

http://www.kovan.ceng.metu.edu.tr/~maya/kmeans/index.html

• Disadvantages:– k must be determined in advance.– Sensitive to initial conditions. The algorithm

minimizes the following “energy” function, but may be trapped in the local minima.

– Applicable only when mean is defined, then what about categorical data? E.g. replace mean with mode (k-modes).

– Arithmetic mean is not robust to outliers (use median – k-medoids).

– Clusters are spherical because the algorithm is based on distance.

2

1||||

K

l Xx lilix

Last lecture summary. Test-data and Cross Validation.

Documents