MIRA, SVM, k-NNxial/Teaching/2018S/slides/IntroAI_21.pdf• Similarity for classification • Case-based reasoning • Predict an instance’s label using similar instances • Nearest-neighbor

Lirong Xia

MIRA, SVM, k-NN

Linear Classifiers (perceptrons)

2

• Inputs are feature values• Each feature has a weight• Sum is the activation

• If the activation is:• Positive: output +1• Negative, output -1

activationw x( ) = wi i fi x( ) = wi f x( )i∑

Classification: Weights

3

• Binary case: compare features to a weight vector• Learning: figure out the weight vector from examples

Binary Decision Rule

4

• In the space of feature vectors• Examples are points• Any weight vector is a hyperplane• One side corresponds to Y = +1• Other corresponds to Y = -1

Learning: Binary Perceptron

5

• Start with weights = 0• For each training instance:

• Classify with current weights

• If correct (i.e. y=y*), no change!• If wrong: adjust the weight vector

by adding or subtracting the feature vector. Subtract if y* is -1.

y =+1 if wi f x( ) ≥ 0

−1 if wi f x( ) < 0

#

$%

&%

w = w+ y*i f

Multiclass Decision Rule

6

• If we have multiple classes:• A weight vector for each class:

• Score (activation) of a class y:

• Prediction highest score wins

yw

wy i f x( )

y = argmaxy

wy i f x( ) Binary = multiclass where the negative class has weight zero

Learning: Multiclass Perceptron

7

• Start with all weights = 0• Pick up training examples one by one• Predict with current weights

• If correct, no change!• If wrong: lower score of wrong

answer, raise score of right answer

( )( )* *

y y

y y

w w f xw w f x

= −= +

y = argmax y wy i f x( ) = arg max y wy ,i i fi x( )i∑

Today

8

• Fixing the Perceptron: MIRA

• Support Vector Machines

• k-nearest neighbor (KNN)

Properties of Perceptrons

9

• Separability: some parameters get the training set perfectly correct

• Convergence: if the training is separable, perceptron will eventually converge (binary case)

Examples: Perceptron

10

• Non-Separable Case

Problems with the Perceptron

11

• Noise: if the data isn’t separable, weights might thrash• Averaging weight vectors over

time can help (averaged perceptron)

• Mediocre generalization: finds a “barely” separating solution

• Overtraining: test / held-out accuracy usually rises, then falls• Overtraining is a kind of overfitting

Fixing the Perceptron

12

• Idea: adjust the weight update to mitigate these effects

• MIRA*: choose an update size that fixes the current mistake

• …but, minimizes the change to w

• The +1 helps to generalize

minw

12

wy −w 'yy∑

2

wy*i f x( ) ≥ wy i f x( )+1

*Margin Infused Relaxed Algorithm

Guessed y instead of y* on example x with features f x( )wy = w 'y−τ f x( )wy* = w 'y*+τ f x( )

Minimum Correcting Update

13

minw

12

wy −wy '2

y∑wy*i f ≥ wy i f +1

min not τ=0, or would nothave made an error, so min will be where equality holds

( )( )* *

''

y y

y y

w w f xw w f x

ττ

= −= +

minτ

τ f2

wy*i f ≥ wy i f +1

minτ τ2

w 'y*+τ f( ) f ≥ w 'y−τ f( ) f +1τ =

w 'y−w 'y*( ) f +12 f i f

Maximum Step Size

14

• In practice, it’s also bad to make updates that are too large• Example may be labeled incorrectly• You may not have enough features• Solution: cap the maximum possible

value of τ with some constant C

• Corresponds to an optimization that assumes non-separable data

• Usually converges faster than perceptron

• Usually better, especially on noisy data

τ*=minw 'y−w 'y*( ) f +1

2 f i f,C

"

#

$$

%

&

''

Outline

15




Linear Separators

16

• Which of these linear separators is optimal?

Support Vector Machines

17

• Maximizing the margin: good according to intuition, theory, practice• Only support vectors matter; other training examples are ignorable• Support vector machines (SVMs) find the separator with max

margin• Basically, SVMs are MIRA where you optimize over all examples at

once

minw

12

w−w 'y∑

2

wy*i f xi( ) ≥ wy i f xi( )+1

minw

12

wy∑

2

∀i, y wy*i f xi( ) ≥ wy i f xi( )+1

MIRA

SVM

Classification: Comparison

18

• Naive Bayes• Builds a model training data• Gives prediction probabilities• Strong assumptions about feature independence• One pass through data (counting)

• Perceptrons / MIRA:• Makes less assumptions about data• Mistake-driven learning• Multiple passes through data (prediction)• Often more accurate

Outline

19




Case-Based Reasoning

20

• Similarity for classification• Case-based reasoning• Predict an instance’s label using

similar instances

• Nearest-neighbor classification• 1-NN: copy the label of the most

similar data point• K-NN: let the k nearest neighbors

vote (have to devise a weighting scheme)

• Key issue: how to define similarity• Trade-off:

• Small k gives relevant neighbors• Large k gives smoother functions

Generated data

1-NN

.

. .

.

. .

Parametric / Non-parametric

21

• Parametric models:• Fixed set of parameters• More data means better settings

• Non-parametric models:• Complexity of the classifier increases with

data• Better in the limit, often worse in the non-limit

• (K)NN is non-parametric

Nearest-Neighbor Classification

22

• Nearest neighbor for digits:• Take new image• Compare to all training images• Assign based on closest example

• Encoding: image is vector of intensities:

• What’s the similarity function?• Dot product of two images vectors?

• Usually normalize vectors so ||x||=1• min = 0 (when?), max = 1(when?)

= 0.0 0.0 0.3 0.8 0.7 0.10.0

sim x,x '( ) = xix ' = xix 'ii∑

Basic Similarity

23

• Many similarities based on feature dot products:

• If features are just the pixels:

• Note: not all similarities are of this form

sim x,x '( ) = xix ' = xix 'ii∑

sim x,x '( ) = f x( )i f x '( ) = fi x( ) fi x '( )i∑

Invariant Metrics

24

• Better distances use knowledge about vision• Invariant metrics:

• Similarities are invariant under certain transformations• Rotation, scaling, translation, stroke-thickness…• E.g.:

• 16*16=256 pixels; a point in 256-dim space• Small similarity in R256 (why?)

• How to incorporate invariance into similarities?

This and next few slides adapted from Xiao Hu, UIUC

Invariant Metrics

25

• Each example is now a curve in R256

• Rotation invariant similarity:

s’=max s(r( ),r( ))

• E.g. highest similarity between images’ rotation lines

MIRA, SVM, k-NNxial/Teaching/2018S/slides/IntroAI_21.pdf• Similarity for classification • Case-based reasoning • Predict an instance’s label using similar instances • Nearest-neighbor

Documents