Top Banner
Statistical Learning Part 2 Nonparametric Learning: A Survey R. Moeller Hamburg University of Technology
51
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 11-Learning-5

Statistical Learning Part 2Nonparametric Learning:

A Survey

R. MoellerHamburg University of Technology

Page 2: 11-Learning-5

Instance-Based Learning

• So far we saw statistical learning asparameter learning, i.e., given a specificparameter-dependent family of probabilitymodels fit it to the data by tweakingparameters

Often simple and effective Fixed complexity Maybe good for very little data

Page 3: 11-Learning-5

Instance-Based Learning

• So far we saw statistical learning asparameter learning

• Nonparametric learning methods allowhypothesis complexity to grow withthe data “The more data we have, the more

‘wigglier’ the hypothesis can be”

Page 4: 11-Learning-5

Characteristics

• An instance-based learner is a lazy-learner and does all the work when thetest example is presented. This isopposed to so-called eager-learners,which build a parameterised compactmodel of the target.

• It produces local approximation to thetarget function (different with eachtest instance)

Page 5: 11-Learning-5

Nearest Neighbor Classifier

• Basic idea Store all labelled instances (i.e., the training set)

and compare new unlabeled instances (i.e., thetest set) to the stored ones to assign them anappropriate label.

Comparison is performed by means of theEuclidean distance, the labels of the k nearestneighbors of a new instance determine theassigned label

• Parameter: k (the number of nearestneighbors)

Page 6: 11-Learning-5

• 1-Nearest neighbor:Given a query instance xq,• first locate the nearest training example xn• then f(xq):= f(xn)

• K-Nearest neighbor:Given a query instance xq,• first locate the k nearest training examples• if discrete values target function then take vote

among its k nearest nbrs else if real valued target fct then take the mean ofthe f values of the k nearest nbrs

k

xfxf

k

i i

q

! == 1)(

:)(

Nearest Neighbor Classifier

Page 7: 11-Learning-5

Distance Between Examples

• We need a measure of distance in order to knowwho are the neighbours

• Assume that we have T attributes for thelearning problem. Then one example point x haselements xt ∈ ℜ, t=1,…T.

• The distance between two points xi xj is oftendefined as the Euclidean distance:

!=

"=T

t

tjtiji xxd1

2][),( xx

Page 8: 11-Learning-5

Difficulties

• There may be irrelevant attributesamongst the attributes – curse ofdimensionality

• Have to calculate the distance of thetest case from all training cases

Page 9: 11-Learning-5
Page 10: 11-Learning-5

kNN vs 1NN: Voronoi Diagram

Page 11: 11-Learning-5
Page 12: 11-Learning-5

When to Consider kNN Algorithms?

• Instances map to points in• Not more then say 20 attributes per

instance• Lots of training data• Advantages:

Training is very fast Can learn complex target functions Don’t lose information

• Disadvantages: ? (will see them shortly…)

n!

Page 13: 11-Learning-5

twoone

four

three

five six

seven Eight ?

Page 14: 11-Learning-5

Training data

Number Lines Line types Rectangles Colours Mondrian?

1 6 1 10 4 No

2 4 2 8 5 No

3 5 2 7 4 Yes

4 5 1 8 4 Yes

5 5 1 10 5 No

6 6 1 8 6 Yes

7 7 1 14 5 No

Number Lines Line types Rectangles Colours Mondrian?

8 7 2 9 4

Test instance

Page 15: 11-Learning-5

Keep Data in Normalized Form

One way to normalise the data ar(x) toa´r(x) is

t

tt

t

xxx

!

"#'

!

xr "mean of tthattributes

!

"t# standard deviationof t

thattributes

average distance of the data values from their mean

Page 16: 11-Learning-5

Normalised Training Data

Number Lines Line

types

Rectangles Colours Mondrian?

1 0.632 -0.632 0.327 -1.021 No

2 -1.581 1.581 -0.588 0.408 No

3 -0.474 1.581 -1.046 -1.021 Yes

4 -0.474 -0.632 -0.588 -1.021 Yes

5 -0.474 -0.632 0.327 0.408 No

6 0.632 -0.632 -0.588 1.837 Yes

7 1.739 -0.632 2.157 0.408 No

Number Lines Line

types

Rectangles Colours Mondrian?

8 1.739 1.581 -0.131 -1.021

Test instance

Page 17: 11-Learning-5

Distances of Test InstanceFrom Training Data

Example Distanceof testfromexample

Mondrian?

1 2.517 No

2 3.644 No

3 2.395 Yes

4 3.164 Yes

5 3.472 No

6 3.808 Yes

7 3.490 No

Classification1-NN Yes

3-NN Yes

5-NN No

7-NN No

Page 18: 11-Learning-5

What if the target function is real valued?

• The k-nearest neighbor algorithmwould just calculate the mean of thek nearest neighbours

Page 19: 11-Learning-5

Variant of kNN: Distance-Weighted kNN

• We might want to weight nearerneighbors more heavily

• Then it makes sense to use alltraining examples instead of just k(Stepard’s method)

2

1

1

),(

1 where

)(:)(

iq

ik

i i

k

i ii

qd

ww

fwf

xx

x

x ==

!!

=

=

Page 20: 11-Learning-5

Parzen classifier

• Basic idea Estimates the densities of the distributions of instances for

each class by summing the distance-weighedcontributions of each instance in a class within a pre-defined window function.

• Parameter h: size of the window over which the“interpolation” takes place. For h=0 one obtains aDirac delta function, for increasing values of hmore “smoothing” is achieved.

Page 21: 11-Learning-5
Page 22: 11-Learning-5

n Gaussian Parzen windows, with “smoothing factor” h applied to Gaussian distributed points

Page 23: 11-Learning-5
Page 24: 11-Learning-5
Page 25: 11-Learning-5
Page 26: 11-Learning-5

Parzen classification

Parzen classifiers estimate thedensities for each category andclassify a test instance by the labelcorresponding to the maximumposterior

The decision region for a Parzen-window classifier depends upon thechoice of window function

Page 27: 11-Learning-5
Page 28: 11-Learning-5
Page 29: 11-Learning-5
Page 30: 11-Learning-5

XOR problem

Page 31: 11-Learning-5
Page 32: 11-Learning-5
Page 33: 11-Learning-5

XOR problem

Page 34: 11-Learning-5

XOR problem

• Solution:Use multiplelayers

Page 35: 11-Learning-5
Page 36: 11-Learning-5
Page 37: 11-Learning-5
Page 38: 11-Learning-5

Backpropagation classifier

• Basic idea Mapping the instances via a non-linear

transformation into a space where theycan be separated by a hyperplane thatminimizes the classification error

• Parameters: learning rate and numberof non-linear basis functions,

Page 39: 11-Learning-5
Page 40: 11-Learning-5
Page 41: 11-Learning-5
Page 42: 11-Learning-5

Radial Basis Function Classifier

• Basic idea: Representing the instances by a number

of prototypes. Each prototype is activatedby weighting its nearest instances. Theprototype activations may be taken asinput for a perceptron or other classifier.

• Parameters: Number of prototypes (basis functions)

Page 43: 11-Learning-5

Example of a basis function

Page 44: 11-Learning-5

Support Vector Machine Classifier

• Basic idea Mapping the instances from the two

classes into a space where they becomelinearly separable. The mapping isachieved using a kernel function thatoperates on the instances near to themargin of separation.

• Parameter: kernel type

Page 45: 11-Learning-5

y = +1

y = -1

Nonlinear Separation

Page 46: 11-Learning-5

)2,,( 21

2

2

2

1 xxxx

Page 47: 11-Learning-5

margin separator (beslislijn)

support vectors

Support Vectors

Page 48: 11-Learning-5

Kernels

• Support Vector Machines aregenerally known as “kernelmachines”

Page 49: 11-Learning-5

2)(),( yxyxK !=

Example Kernel

Page 50: 11-Learning-5

Literature

Mitchell (1989). Machine Learning.http://www.cs.cmu.edu/~tom/mlbook.html

Duda, Hart, & Stork (2000). Pattern Classification.http://rii.ricoh.com/~stork/DHS.html

Hastie, Tibshirani, & Friedman (2001). The Elements of StatisticalLearning. http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Page 51: 11-Learning-5

Literature (cont.)

Russell & Norvig (2004). Artificial Intelligence.http://aima.cs.berkeley.edu/

Shawe-Taylor & Cristianini. Kernel Methodsfor Pattern Analysis.http://www.kernel-methods.net/