Top Banner
Last lecture summary
74

Last lecture summary. Test-data and Cross Validation.

Dec 27, 2015

Download

Documents

Carol Allison
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Last lecture summary. Test-data and Cross Validation.

Last lecture summary

Page 2: Last lecture summary. Test-data and Cross Validation.

Test-data and Cross Validation

Page 3: Last lecture summary. Test-data and Cross Validation.

training error

testing error

model complexity

Page 4: Last lecture summary. Test-data and Cross Validation.

Test set method• Split the data set into training and test data sets.• Common ration – 70:30• Train the algorithm on training set, assess its

performance on the test set.• Disadvantages– This is simple, however it wastes data.– Test set estimator of performance has high variance

adopted from Cross Validation tutorial, Andrew Moorehttp://www.autonlab.org/tutorials/overfit.html

Train Test

Page 5: Last lecture summary. Test-data and Cross Validation.

• stratified division– same proportion of data in the training and test

sets

Page 6: Last lecture summary. Test-data and Cross Validation.

• Training error can not be used as an indicator of model’s performance due to overfitting.

• Training data set - train a range of models, or a given model with a range of values for its parameters.

• Compare them on independent data – Validation set.– If the model design is iterated many times, then

some overfitting to the validation data can occur and so it may be necessary to keep aside a third

• Test set on which the performance of the selected model is finally evaluated.

Page 7: Last lecture summary. Test-data and Cross Validation.

LOOCV

1. choose one data point2. remove it from the set3. fit the remaining data points4. note your error using the removed data point as

test

Repeat these steps for all points. When you are done report the mean square error (in case of regression).

Page 8: Last lecture summary. Test-data and Cross Validation.

k-fold crossvalidation

1. randomly break data into k partitions2. remove one partition from the set3. fit the remaining data points4. note your error using the removed partition as test

data set

Repeat these steps for all partitions. When you are done report the mean square error (in case of regression).

Page 9: Last lecture summary. Test-data and Cross Validation.

Selection and testing• Complete procedure to algorithm selection and

estimation of its quality1. Divide data to train/test

2. By Cross Validation on the Train choose the algorithm

3. Use this algorithm to construct a classifier using Train

4. Estimate its quality on the Test

Train Test

Train

Test

Train Val

Page 10: Last lecture summary. Test-data and Cross Validation.

Model selection via CV

degree MSEtrain MSE10-fold Choice

1

2

3

4

5

6

adop

ted

from

Cro

ss V

alid

ation

tuto

rial b

y An

drew

Moo

re, h

ttp:

//w

ww

.aut

onla

b.or

g/tu

toria

ls/o

verfi

t.htm

l

polynomial regression

Page 11: Last lecture summary. Test-data and Cross Validation.
Page 12: Last lecture summary. Test-data and Cross Validation.

Nearest Neighbors Classification

Page 13: Last lecture summary. Test-data and Cross Validation.

instances

Page 14: Last lecture summary. Test-data and Cross Validation.

• Similarity sij is quantity that reflects the strength of relationship between two objects or two features.

• Distance dij measures dissimilarity– Dissimilarity measure the discrepancy between

the two objects based on several features.– Distance satisfies the following conditions:• distance is always positive or zero (dij ≥ 0)• distance is zero if and only if it measured to itself• distance is symmetric (dij = dji)

– In addition, if distance satisfies triangular inequality |x+y| ≤ |x|+|y|, then it is called metric.

Page 15: Last lecture summary. Test-data and Cross Validation.

Distances for quantitative variables

• Minkowski distance (Lp norm)

• distance matrix – matrix with all pairwise distances

1

np

pp i i

i

L x y

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

Page 16: Last lecture summary. Test-data and Cross Validation.

x1

x2

y1

y2

11

,n

i ii

L d x y x y

Manhattan distance

Page 17: Last lecture summary. Test-data and Cross Validation.

x1

x2

y1

y2

Euclidean distance

221

,n

i ii

L d x y x y

Page 18: Last lecture summary. Test-data and Cross Validation.
Page 19: Last lecture summary. Test-data and Cross Validation.

k-NN

• supervised learning• target function f may be– dicrete-valued (classification)– real-valued (regression)

• We assign to the class which instance is most similar to the given point.

Page 20: Last lecture summary. Test-data and Cross Validation.

• k-NN is a lazy learner• lazy learning– generalization beyond the training data is delayed

until a query is made to the system– opposed to eager learning – system tries to

generalize the training data before receiving queries

Page 21: Last lecture summary. Test-data and Cross Validation.

Which k is best?

Hastie et al., Elements of Statistical Learning

k = 1 k = 15

fitting noise, outliersoverfitting

value not too small smooth out distinctive behavior

Crossvalidation

Page 22: Last lecture summary. Test-data and Cross Validation.

Real-valued target function

• Algorithm calculates the mean value of the k nearest training examples.

value = 12

value = 14

value = 10

value = (12+14+10)/3 = 12

k = 3

Page 23: Last lecture summary. Test-data and Cross Validation.

Distance-weighted NN• Give greater weight to closer neighbors

k = 4 unweighted• 2 votes • 2 votes

weighted• 1/12 + 1/22 = 1.25 votes• 1/42 + 1/52 = 0.102 votes

1

2

4

5

Page 24: Last lecture summary. Test-data and Cross Validation.

k-NN issues

• Curse of dimensionality is a problem.• Significant computation may be required to

process each new query.• To find nearest neighbors one has to evaluate

full distance matrix.• Efficient indexing of stored training examples

helps– kd-tree

Page 25: Last lecture summary. Test-data and Cross Validation.
Page 26: Last lecture summary. Test-data and Cross Validation.

Cluster Analysis

Page 27: Last lecture summary. Test-data and Cross Validation.

• We have data, we don’t know classes.

• Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.

Page 28: Last lecture summary. Test-data and Cross Validation.

• We have data, we don’t know classes.

• Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.

Page 29: Last lecture summary. Test-data and Cross Validation.

On clustering validation techniques, M. Halkidi, Y. Batistakis, M. Vazirgiannis

Stages of clustering process

Page 30: Last lecture summary. Test-data and Cross Validation.

How would you solve the problem?

• How to find clusters? • Group together most similar patterns.

Page 31: Last lecture summary. Test-data and Cross Validation.

Single linkage(metoda nejbližšího souseda)

based on A Tutorial on Clustering Algorithms http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

Page 32: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

BA FL MI/TO NA RM

MI/TO 0

Page 33: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

996877

BA FL MI/TO NA RM

MI/TO 0877

Page 34: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

400

295

295877

BA FL MI/TO NA RM

MI/TO 0

Page 35: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

869

754

BA FL MI/TO NA RM

MI/TO 0295877 754

Page 36: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

669

564

BA FL MI/TO NA RM

MI/TO 0 754295877 564

Page 37: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBari

BA FL MI/TO NA RM

BA 0 662 877 255 412

FL 662 0 295 468 268

MI/TO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

Page 38: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBari

BA FL MI/TO NA/RM

BA 0 662 877 255

FL 662 0 295 268

MI/TO 877 295 0 564

NA/RM 255 268 564 0

Page 39: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBari

BA/NA/RM FL MI/TO

BA/NA/RM 0 268 564

FL 268 0 295

MI/TO 564 295 0

Page 40: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBari

BA/FL/NA/RM MI/TO

BA/FL/NA/RM 0 295

MI/TO 295 0

Page 41: Last lecture summary. Test-data and Cross Validation.

Torino → Milano Rome → Naples

→ Bari → Florence

Join Torino–Milano and Rome–Naples–Bari–Florence

Dendrogram

Page 42: Last lecture summary. Test-data and Cross Validation.

MI TOBA NA RM FL

dissimilarity

DendrogramTorino → Milano (138)Rome → Naples (219)

→ Bari (255)→ Florence (268)

Join Torino–Milano and Rome–Naples–Bari–Florence (295)

219 138

255

268

295

Page 43: Last lecture summary. Test-data and Cross Validation.

MI TOBA NA RM FLdissim

ilarity

Torino

Milano

Florence

Rome

NaplesBari

Torino

Milano

Florence

Rome

NaplesBari

Torino

Milano

Florence

Rome

NaplesBari

Page 44: Last lecture summary. Test-data and Cross Validation.

Complete linkage(metoda nejvzdálenějšího souseda)

Page 45: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

BA FL MI/TO NA RM

MI/TO 0

Page 46: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

996877

BA FL MI/TO NA RM

MI/TO 0996

Page 47: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

400

295

BA FL MI/TO NA RM

MI/TO 0996 400

Page 48: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

869

754

BA FL MI/TO NA RM

MI/TO 0996 400 869

Page 49: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

669

564

BA FL MI/TO NA RM

MI/TO 0996 400 869 669

Page 50: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBari

BA FL MI/TO NA RM

BA 0 662 996 255 412

FL 662 0 400 468 268

MI/TO 996 400 0 869 669

NA 255 468 869 0 219

RM 412 268 669 219 0

Page 51: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBari

BA FL MI/TO NA/RM

BA 0 662 996 412

FL 662 0 400 468

MI/TO 996 400 0 869

NA/RM 412 468 869 0

Page 52: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBari

BA MI/TO/FL NA/RM

BA 0 996 412

MI/TO/FL 996 0 869

NA/RM 412 869 0

Page 53: Last lecture summary. Test-data and Cross Validation.

MI TOBA NA RM FL

MI TOBA NA RM FL

complete linkage

single linkage

Page 54: Last lecture summary. Test-data and Cross Validation.

Average linkage(metoda průměrné vazby)

Page 55: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

996877

BA FL MI/TO NA RM

MI/TO 0936.5

(996+877)/2=936.5

Page 56: Last lecture summary. Test-data and Cross Validation.

Centroid linkage

Page 57: Last lecture summary. Test-data and Cross Validation.

Torino

Milano

Florence

Rome

NaplesBariBA FL MI NA RM TO

BA 0 662 877 255 412 996

FL 662 0 295 468 268 400

MI 877 295 0 754 564 138

NA 255 468 754 0 219 869

RM 412 268 564 219 0 669

TO 996 400 138 869 669 0

895

BA FL MI/TO NA RM

MI/TO 0895

cluster is represented by its centroid

Page 58: Last lecture summary. Test-data and Cross Validation.

Similarity?

• single linkage (MIN)• complete linkage (MAX)• average linkage• centroids

Summary

Page 59: Last lecture summary. Test-data and Cross Validation.

• single linkage (MIN)• complete linkage (MAX)• average linkage• centroids

Summary

Page 60: Last lecture summary. Test-data and Cross Validation.

• single linkage (MIN)• complete linkage (MAX)• average linkage• centroids

Summary

Page 61: Last lecture summary. Test-data and Cross Validation.

• single linkage (MIN)• complete linkage (MAX)• average linkage• centroids

Summary

Page 62: Last lecture summary. Test-data and Cross Validation.

• single linkage (MIN)• complete linkage (MAX)• average linkage• centroids

Summary

Page 63: Last lecture summary. Test-data and Cross Validation.

Ward’s linkage (method)

Page 64: Last lecture summary. Test-data and Cross Validation.

In Ward’s method metrics are not used, they do not have to be chosen. Instead, sums of squares

(i.e. squared Euclidean distances) between centroids of clusters are computed.

Page 65: Last lecture summary. Test-data and Cross Validation.

• Ward's method says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them.

• At the beginning of clustering, the sum of squares starts out at zero (because every point is in its own cluster) and then grows as we merge clusters.

• Ward‘s method keeps this growth as small as possible.

Page 66: Last lecture summary. Test-data and Cross Validation.
Page 67: Last lecture summary. Test-data and Cross Validation.

• hierarchical– groups data with a sequence of nested partitions

• agglomerative– bottom-up– Start with each data point as one cluster, join the clusters up to the

situation when all points form one cluster.• divisive

– top-down– Initially all objects are in one cluster, then the cluster is

subdivided into smaller and smaller pieces.

• partitional– divides data points into some prespecified number of

clusters without the hierarchical structure– i.e. divides the space

Types of clustering

Page 68: Last lecture summary. Test-data and Cross Validation.

Hierarchical clustering

• Agglomerative methods are used more widely.• Divisive methods need to consider (2N − 1 −1)

possible subset divisions, which is very computationally intensive. – computational difficulties of finding the optimum

partitions• Divisive clustering methods are better at

finding large clusters than hierarchical methods.

Page 69: Last lecture summary. Test-data and Cross Validation.

Hierarchical clustering

• Disadvantages– High computational complexity – at least O(N2).• Needs to calculate all mutual distances.

– Inability to adjust once the splitting or merging is performed • no undo

Page 70: Last lecture summary. Test-data and Cross Validation.

k-means

• How to avoid the computing of all mutual distances?

• Calculate distances from representatives (centroids) of clusters.

• Advantage: number of centroids is much lower than the number of data points.

• Disadvantage: number of centroids k must be given in advance

Page 71: Last lecture summary. Test-data and Cross Validation.

k-means – kids algorithm

• Once there was a land with N houses.• One day K kings arrived to this land.• Each house was taken by the nearest king.• But the community wanted their king to be at the

center of the village, so the throne was moved there.

• Then the kings realized that some houses were closer to them now, so they took those houses, but they lost some.. This went on and on…

• Until one day they couldn't move anymore, so they settled down and lived happily ever after in their village.

Page 72: Last lecture summary. Test-data and Cross Validation.

• decide on the number of clusters k• randomly initialize k centroids• repeat until convergence (centroids do not

move)– assign each point to the cluster represented by

the centroid it is nearest to– move the centroids to the position given as a

mean of all points in the cluster

k-means – adults algorithm

Page 73: Last lecture summary. Test-data and Cross Validation.

k-means applet

http://www.kovan.ceng.metu.edu.tr/~maya/kmeans/index.html

Page 74: Last lecture summary. Test-data and Cross Validation.

• Disadvantages:– k must be determined in advance.– Sensitive to initial conditions. The algorithm

minimizes the following “energy” function, but may be trapped in the local minima.

– Applicable only when mean is defined, then what about categorical data? E.g. replace mean with mode (k-modes).

– Arithmetic mean is not robust to outliers (use median – k-medoids).

– Clusters are spherical because the algorithm is based on distance.

2

1||||

K

l Xx lilix