1 Online Learning Online Learning Algorithms Algorithms
Dec 23, 2015
1
Online Learning Online Learning AlgorithmsAlgorithms
2
Outline
• Online learning Framework
• Design principles of online learning algorithms (additive
updates) Perceptron, Passive-Aggressive and Confidence weighted
classification
Classification – binary, multi-class and structured prediction
Hypothesis averaging and Regularization
• Multiplicative updates Weighted majority, Winnow, and connections to Gradient
Descent(GD) and Exponentiated Gradient Descent (EGD)
3
Formal setting – Classification
• Instances Images, Sentences
• Labels Parse tree, Names
• Prediction rule Linear prediction rule
• Loss No. of mistakes
4
Predictions
• Continuous predictions :
Label
Confidence
• Linear Classifiers
Prediction :
Confidence:
5
Loss Functions
• Natural Loss: Zero-One loss:
• Real-valued-predictions loss: Hinge loss:
Exponential loss (Boosting)
6
Loss Functions
1
1Zero-One Loss
Hinge Loss
7
Online Framework
• Initialize Classifier• Algorithm works in rounds• On round the online algorithm :
Receives an input instance Outputs a prediction Receives a feedback label Computes loss Updates the prediction rule
• Goal : Suffer small cumulative loss
8
• Margin of an example with respect to the classifier :
• Note :
• The set is separable iff there exists u such that
Margin
9
Geometrical Interpretation
Margin >0
Margin <<0
Margin <0Margin >>0
10
Hinge Loss
11
Why Online Learning?
• Fast• Memory efficient - process one example at
a time• Simple to implement• Formal guarantees – Mistake bounds • Online to Batch conversions• No statistical assumptions• Adaptive
12
Update Rules• Online algorithms are based on an update rule
which defines from (and possibly other information)
• Linear Classifiers : find from based on the input
• Some Update Rules :
Perceptron (Rosenblat) ALMA (Gentile) ROMMA (Li & Long) NORMA (Kivinen et. al)
MIRA (Crammer & Singer) EG (Littlestown and Warmuth) Bregman Based (Warmuth) CWL (Dredge et. al)
13
Design Principles of Algorithms
• If the learner suffers non-zero loss at any round, then
we want to balance two goals:
Corrective: Change weights enough so that we don’t make
this error again (1)
Conservative: Don’t change the weights too much (2)
How to define too much ?
14
Design Principles of Algorithms
• If we use Euclidean distance to measure the change between old and new
weights
Enforcing (1) and minimizing (2)
e.g., Perceptron for squared loss (Windrow-Hoff or Least Mean Squares)
• Passive-Aggressive algorithms do exactly same
except (1) is much stronger – we want to make a correct classification with
margin of at least 1
• Confidence-Weighted classifiers
maintains a distribution over weight vectors
(1) is same as passive-aggressive with a probabilistic notion of margin
Change is measured by KL divergence between two distributions
15
Design Principles of Algorithms
• If we assume all weights are positive we can use (unnormalized) KL divergence to
measure the change
Multiplicative update or EG algorithm (Kivinen and Warmuth)
16
The Perceptron Algorithm
• If No-Mistake
Do nothing
• If Mistake
Update
• Margin after update:
17
Passive-Aggressive Passive-Aggressive AlgorithmsAlgorithms
18
Passive-Aggressive: Motivation
• Perceptron: No guaranties of margin after the update
• PA: Enforce a minimal non-zero margin after the update
• In particular: If the margin is large enough (1), then do nothing If the margin is less then unit, update such that
the margin after the update is enforced to be unit
19
Aggressive Update Step
• Set to be the solution of the following optimization problem:
• Closed-form update:
(2)
(1)
where,
20
Passive-Aggressive Update
21
Unrealizable Case
22
Confidence Weighted Confidence Weighted ClassificationClassification
23
Confidence-Weighted Classification: Motivation
• Many positive reviews with the word best
Wbest
• Later negative review “boring book – best if you want to sleep in seconds”
• Linear update will reduce both
Wbest Wboring
• But best appeared more than boring
• How to adjust weights at different rates?Wboring Wbest
24
• The weight vector is a linear combination of examples
• Two rate schedules (among others): Perceptron algorithm, conservative:
Passive-aggressive
Update Rules
25
Distributions in Version Space
Example
Mean weight-vector
Q u ic k T ime ™ a n d a d e c o mp re s s o r
a re n e e d e d to s e e th is p ic tu re .
26
Margin as a Random Variable
• Signed margin
is a Gaussian-distributed variable
• Thus:
27
PA-like Update
• PA:
• New Update :
28
Place most of the probability mass in this region
Weight Vector (Version) Space
29
Nothing to do, most weight vectors already classify the example correctly
Passive Step
30
Project the current Gaussian distribution onto the half-space
Aggressive Step
The covariance is shirked in the direction of the new example
Mean moved past the mistake line(large margin)
31
Extensions: Extensions: Multi-class and Structured Multi-class and Structured
PredictionPrediction
32
Multiclass Representation I
• k Prototypes• New instance • Compute
• Prediction: the class achieving the highest Score
Class r
1 -1.08
2 1.66
3 0.37
4 -2.09
33
• Map all input and labels into a joint vector space
• Score labels by projecting the corresponding feature vector
Multiclass Representation II
Estimated volume was a light 2.4 million ounces .
F ) =0 1 1 0( … B I O B I I I I O
34
Multiclass Representation II
• Predict label with highest score (Inference)
• Naïve search is expensive if the set of possible labels is large
No. of labelings = 3No. of words
B I O B I I I I O
Estimated volume was a light 2.4 million ounces .
Efficient Viterbi decoding for sequences!
35
Two Representations
• Weight-vector per class (Representation I) Intuitive Improved algorithms
• Single weight-vector (Representation II) Generalizes representation I
Allows complex interactions between input and output
0 0 0 x 0F(x,4)=
36
• Binary:
• Multi Class:
Margin for Multi Class
37
• But different mistakes cost (aka loss function) differently – so use it!
• Margin scaled by loss function:
Margin for Multi Class
38
• Initialize • For
Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule
Perceptron Multiclass online algorithm
39
• Initialize • For
Receive an input instance Outputs a prediction Receives a feedback label Computes loss Update the prediction rule
PA Multiclass online algorithm
40
Regularization
• Key Idea: If an online algorithm works well on a
sequence of i.i.d examples, then an ensemble of online hypotheses should generalize well.
• Popular choices: the averaged hypothesis the majority vote use validation set to make a choice