Lirong Xia MIRA, SVM, k-NN
Lirong Xia
MIRA, SVM, k-NN
Linear Classifiers (perceptrons)
2
• Inputs are feature values• Each feature has a weight• Sum is the activation
• If the activation is:• Positive: output +1• Negative, output -1
activationw x( ) = wi i fi x( ) = wi f x( )i∑
Classification: Weights
3
• Binary case: compare features to a weight vector• Learning: figure out the weight vector from examples
Binary Decision Rule
4
• In the space of feature vectors• Examples are points• Any weight vector is a hyperplane• One side corresponds to Y = +1• Other corresponds to Y = -1
Learning: Binary Perceptron
5
• Start with weights = 0• For each training instance:
• Classify with current weights
• If correct (i.e. y=y*), no change!• If wrong: adjust the weight vector
by adding or subtracting the feature vector. Subtract if y* is -1.
y =+1 if wi f x( ) ≥ 0
−1 if wi f x( ) < 0
#
$%
&%
w = w+ y*i f
Multiclass Decision Rule
6
• If we have multiple classes:• A weight vector for each class:
• Score (activation) of a class y:
• Prediction highest score wins
yw
wy i f x( )
y = argmaxy
wy i f x( ) Binary = multiclass where the negative class has weight zero
Learning: Multiclass Perceptron
7
• Start with all weights = 0• Pick up training examples one by one• Predict with current weights
• If correct, no change!• If wrong: lower score of wrong
answer, raise score of right answer
( )( )* *
y y
y y
w w f xw w f x
= −= +
y = argmax y wy i f x( ) = arg max y wy ,i i fi x( )i∑
Today
8
• Fixing the Perceptron: MIRA
• Support Vector Machines
• k-nearest neighbor (KNN)
Properties of Perceptrons
9
• Separability: some parameters get the training set perfectly correct
• Convergence: if the training is separable, perceptron will eventually converge (binary case)
Examples: Perceptron
10
• Non-Separable Case
Problems with the Perceptron
11
• Noise: if the data isn’t separable, weights might thrash• Averaging weight vectors over
time can help (averaged perceptron)
• Mediocre generalization: finds a “barely” separating solution
• Overtraining: test / held-out accuracy usually rises, then falls• Overtraining is a kind of overfitting
Fixing the Perceptron
12
• Idea: adjust the weight update to mitigate these effects
• MIRA*: choose an update size that fixes the current mistake
• …but, minimizes the change to w
• The +1 helps to generalize
minw
12
wy −w 'yy∑
2
wy*i f x( ) ≥ wy i f x( )+1
*Margin Infused Relaxed Algorithm
Guessed y instead of y* on example x with features f x( )wy = w 'y−τ f x( )wy* = w 'y*+τ f x( )
Minimum Correcting Update
13
minw
12
wy −wy '2
y∑wy*i f ≥ wy i f +1
min not τ=0, or would nothave made an error, so min will be where equality holds
( )( )* *
''
y y
y y
w w f xw w f x
ττ
= −= +
minτ
τ f2
wy*i f ≥ wy i f +1
minτ τ2
w 'y*+τ f( ) f ≥ w 'y−τ f( ) f +1τ =
w 'y−w 'y*( ) f +12 f i f
Maximum Step Size
14
• In practice, it’s also bad to make updates that are too large• Example may be labeled incorrectly• You may not have enough features• Solution: cap the maximum possible
value of τ with some constant C
• Corresponds to an optimization that assumes non-separable data
• Usually converges faster than perceptron
• Usually better, especially on noisy data
τ*=minw 'y−w 'y*( ) f +1
2 f i f,C
"
#
$$
%
&
''
Outline
15
• Fixing the Perceptron: MIRA
• Support Vector Machines
• k-nearest neighbor (KNN)
Linear Separators
16
• Which of these linear separators is optimal?
Support Vector Machines
17
• Maximizing the margin: good according to intuition, theory, practice• Only support vectors matter; other training examples are ignorable• Support vector machines (SVMs) find the separator with max
margin• Basically, SVMs are MIRA where you optimize over all examples at
once
minw
12
w−w 'y∑
2
wy*i f xi( ) ≥ wy i f xi( )+1
minw
12
wy∑
2
∀i, y wy*i f xi( ) ≥ wy i f xi( )+1
MIRA
SVM
Classification: Comparison
18
• Naive Bayes• Builds a model training data• Gives prediction probabilities• Strong assumptions about feature independence• One pass through data (counting)
• Perceptrons / MIRA:• Makes less assumptions about data• Mistake-driven learning• Multiple passes through data (prediction)• Often more accurate
Outline
19
• Fixing the Perceptron: MIRA
• Support Vector Machines
• k-nearest neighbor (KNN)
Case-Based Reasoning
20
• Similarity for classification• Case-based reasoning• Predict an instance’s label using
similar instances
• Nearest-neighbor classification• 1-NN: copy the label of the most
similar data point• K-NN: let the k nearest neighbors
vote (have to devise a weighting scheme)
• Key issue: how to define similarity• Trade-off:
• Small k gives relevant neighbors• Large k gives smoother functions
Generated data
1-NN
.
. .
.
. .
Parametric / Non-parametric
21
• Parametric models:• Fixed set of parameters• More data means better settings
• Non-parametric models:• Complexity of the classifier increases with
data• Better in the limit, often worse in the non-limit
• (K)NN is non-parametric
Nearest-Neighbor Classification
22
• Nearest neighbor for digits:• Take new image• Compare to all training images• Assign based on closest example
• Encoding: image is vector of intensities:
• What’s the similarity function?• Dot product of two images vectors?
• Usually normalize vectors so ||x||=1• min = 0 (when?), max = 1(when?)
= 0.0 0.0 0.3 0.8 0.7 0.10.0
sim x,x '( ) = xix ' = xix 'ii∑
Basic Similarity
23
• Many similarities based on feature dot products:
• If features are just the pixels:
• Note: not all similarities are of this form
sim x,x '( ) = xix ' = xix 'ii∑
sim x,x '( ) = f x( )i f x '( ) = fi x( ) fi x '( )i∑
Invariant Metrics
24
• Better distances use knowledge about vision• Invariant metrics:
• Similarities are invariant under certain transformations• Rotation, scaling, translation, stroke-thickness…• E.g.:
• 16*16=256 pixels; a point in 256-dim space• Small similarity in R256 (why?)
• How to incorporate invariance into similarities?
This and next few slides adapted from Xiao Hu, UIUC
Invariant Metrics
25
• Each example is now a curve in R256
• Rotation invariant similarity:
s’=max s(r( ),r( ))
• E.g. highest similarity between images’ rotation lines