Top Banner

Click here to load reader


Oct 14, 2014




Statistical Learning Part 2 Nonparametric Learning: A SurveyR. Moeller Hamburg University of Technology

Instance-Based Learning

So far we saw statistical learning as parameter learning, i.e., given a specic parameter-dependent family of probability models t it to the data by tweaking parameters Often simple and effective Fixed complexity Maybe good for very little data

Instance-Based Learning

So far we saw statistical learning as parameter learning Nonparametric learning methods allow hypothesis complexity to grow with the data The more data we have, the more wigglier the hypothesis can be

Characteristics An instance-based learner is a lazylearner and does all the work when the test example is presented. This is opposed to so-called eager-learners, which build a parameterised compact model of the target. It produces local approximation to the target function (different with each test instance)

Nearest Neighbor Classier Basic idea Store all labelled instances (i.e., the training set) and compare new unlabeled instances (i.e., the test set) to the stored ones to assign them an appropriate label. Comparison is performed by means of the Euclidean distance, the labels of the k nearest neighbors of a new instance determine the assigned label

Parameter: k (the number of nearest neighbors)

Nearest Neighbor Classier 1-Nearest neighbor:Given a query instance xq, rst locate the nearest training example xn then f(xq):= f(xn) Given a query instance xq, rst locate the k nearest training examples if discrete values target function then take vote among its k nearest nbrs else if real valued target fct then take the mean of the f values of the k nearest nbrs

K-Nearest neighbor:

! f ( x ) :=q


i =1

f ( xi )


Distance Between Examples We need a measure of distance in order to know who are the neighbours Assume that we have T attributes for the learning problem. Then one example point x has elements xt , t=1,T. The distance between two points xi xj is often dened as the Euclidean distance:

d (x i , x j ) =

[ xti " xtj ]2 !t =1



There may be irrelevant attributes amongst the attributes curse of dimensionality Have to calculate the distance of the test case from all training cases

kNN vs 1NN: Voronoi Diagram

When to Consider kNN Algorithms? Instances map to points in ! n Not more then say 20 attributes per instance Lots of training data Advantages: Training is very fast Can learn complex target functions Dont lose information

Disadvantages: ? (will see them shortly)








Eight ?

Training dataNumber Lines 1 2 3 4 5 6 7 6 4 5 5 5 6 7 Line types 1 2 2 1 1 1 1 Rectangles Colours Mondrian? 10 8 7 8 10 8 14 4 5 4 4 5 6 5 No No Yes Yes No Yes No

Test instanceNumber Lines Line types Rectangles Colours Mondrian? 8 7 2 9 4

Keep Data in Normalized FormOne way to normalise the data ar(x) toar(x) is

xt " x t xt ' # !tx r " mean of t th attributes

" t # standard deviationof t attributesaverage distance of the data values from their mean


Normalised Training DataNumber Lines 1 2 3 4 5 6 7 Line types 0.632 -0.632 1.581 1.581 -0.632 -0.632 -0.632 -0.632 Rectangles Colours Mondrian? 0.327 -0.588 -1.046 -0.588 0.327 -0.588 2.157 -1.021 0.408 -1.021 -1.021 0.408 1.837 0.408 No No Yes Yes No Yes No

-1.581 -0.474 -0.474 -0.474 0.632 1.739

Test instanceNumber Lines 8 1.739 Line types 1.581 Rectangles Colours Mondrian? -0.131 -1.021

Distances of Test Instance From Training DataExample Distance Mondrian? of test from example 1 No 2.517 2 3 4 5 6 7 3.644 2.395 3.164 3.472 3.808 3.490 No Yes Yes No Yes No

Classification1-NN 3-NN 5-NN 7-NN Yes Yes No No

What if the target function is real valued?

The k-nearest neighbor algorithm would just calculate the mean of the k nearest neighbours

Variant of kNN: Distance-Weighted kNN

We might want to weight nearer neighbors more heavilyf (x q

! ) :=


i =1

wi f (x i )k i =1



1 where wi = 2 d (x q , x i )

Then it makes sense to use all training examples instead of just k (Stepards method)

Parzen classier Basic idea Estimates the densities of the distributions of instances for each class by summing the distance-weighed contributions of each instance in a class within a predened window function.

Parameter h: size of the window over which the interpolation takes place. For h=0 one obtains a Dirac delta function, for increasing values of h more smoothing is achieved.

n Gaussian Parzen windows, with smoothing factor h applied to Gaussian distributed points

Parzen classication Parzen classiers estimate the densities for each category and classify a test instance by the label corresponding to the maximum posterior The decision region for a Parzenwindow classier depends upon the choice of window function

XOR problem

XOR problem

XOR problem

Solution: Use multiple layers

Backpropagation classier Basic idea Mapping the instances via a non-linear transformation into a space where they can be separated by a hyperplane that minimizes the classication error

Parameters: learning rate and number of non-linear basis functions,

Radial Basis Function Classier Basic idea: Representing the instances by a number of prototypes. Each prototype is activated by weighting its nearest instances. The prototype activations may be taken as input for a perceptron or other classier.

Parameters: Number of prototypes (basis functions)

Example of a basis function

Support Vector Machine Classier Basic idea Mapping the instances from the two classes into a space where they become linearly separable. The mapping is achieved using a kernel function that operates on the instances near to the margin of separation.

Parameter: kernel type

Nonlinear Separationy = -1 y = +1

( x , x , 2 x1 x2 )

2 1

2 2

Support Vectorsmargin separator (beslislijn)

support vectors

Kernels Support Vector Machines are generally known as kernel machines

Example KernelK ( x, y ) = ( x ! y )2

LiteratureMitchell (1989). Machine Learning.

Duda, Hart, & Stork (2000). Pattern Classification.

Hastie, Tibshirani, & Friedman (2001). The Elements of Statistical Learning.

Literature (cont.)

Shawe-Taylor & Cristianini. Kernel Methods for Pattern Analysis.

Russell & Norvig (2004). Artificial Intelligence.