Virtual Vector Machine for Bayesian Online Classification
Yuan (Alan) QiCS & StatisticsPurdue
June, 2009
Joint work with T.P. Minka and R. Xiang
MotivationUbiquitous data stream: emails, stock prices, images
from satellites, video surveillance
How to process a data stream using a small memory buffer and make accurate predictions?
Outline
• Introduction• Virtual Vector Machine• Experimental Results• Summary
Introduction
• Online learning:– Update model and make predictions based on
data points received sequentially– Use a fixed-size memory buffer
Classical online learning
• Classification: – Perceptron
• Linear regression:– Kalman filtering
Bayesian treatment
• Monte Carlo methods (e.g., particle filters)– Difficult for classification model due to high
dimensionality
• Deterministic methods:– Assumed density filtering: Gaussian process
classification models (Csato 2002).
Virtual Vector Machine preview
• Two parts: –Gaussian approximation factors–Virtual points for nonGaussian factors • Summarize multiple real data points• Flexible functional forms• Stored in data cache with a user-defined size.
Outline
• Introduction• Virtual Vector Machine• Experimental Results• Summary
Online Bayesian classification
Model parameters: Data from time 1 to T:Likelihood function at time t:Prior distribution: Posterior at time T:
Flipping noise model
• Labeling error rate• : Feature vector scaled by 1 or -1 depending on
the label. • Posterior distribution: planes cutting a sphere for
3-D case.
Gaussian approximation by EP
• approximates the likelihood
• Both and have the form of Gaussian. Therefore, is a Gaussian.
VVM enlarges approximation family
: virtual point : exact form of the original likelihood
function. (Could be more flexible.) : residue
Reduction to Gaussian
• From the augmented representation, we can reduce to a Gaussian by EP smoothing on virtual points with prior :
• is Gaussian too.
Cost function for finding virtual points
• Minimizing cost function with ADF spirit:
contains one more nonlinear factor than . • Maximizing surrogate function:
• Keep informative (non-Gaussian) information in virtual points.
Computationally intractable…
Cost function for finding virtual points
• Minimizing cost function with ADF spirit:
contains one more nonlinear factor than . • Maximizing surrogate function:
• Keep informative (non-Gaussian) information in virtual points.
Two basic operations
• Searching over all possible locations for : computationally expensive!
• For efficiency, consider only two operations to generate virtual points: – Eviction: delete the least informative point– Merging: merge two similar points to one
Eviction
• After adding the new point into the virtual point set, – Select by maximizing – Remove from the cache – Update the residual via
Version space for 3-D case
Version space: brown areaEP approximation: red ellipseFour data points: hyperplanes
Version space with three points after deleting one point (with the largest margin)
Merging
• Remove from the cache, • Insert the merged point into the cache• Update the residual via
where Gaussian residual term captures the lost information from the original two factors.
Equivalent to replace by
Version space for 3-D case
Version space: brown areaEP approximation: red ellipseFour data points: hyperplanes
Version space with three points after merging two similar points
Compute residue term
• Inverse ADF: match the moments of the left and right distributions:
Efficiently solved by Gauss-Newton method as an one-dimensional problem
Algorithm Summary
• Random feature expansion (Rahimi & Recht, 2007):
• For RBF kernels, we use random Fourier features:
• Where are sampled from a special .
Classification with random features
Outline
• Introduction• Virtual Vector Machine• Experimental Results• Summary
Estimation accuracy of posterior mean
Mean square error of estimated posterior mean obtained by EP, virtual vector machine , ADF and window-EP (W-EP). The exact posterior mean is obtained via a Monte Carlo method. The results are averaged over 20 runs.
Online classification (1)
Accumulative prediction error rates of VVM, the sparse online Gaussian process classier (SOGP), the Passive-Aggressive (PA) algorithm and the Topmoumoute online natural gradient (NG) algorithm on the Spambase dataset. The size of virtual point set used by VVM is 30, while the online Gaussian process model has 143 basis points.
Online nonlinear classification (2)
Accumulative prediction error rates of VVM and competing methods on the Thyroid dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 10 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 12 and 91 basis points, respectively.
Online nonlinear classification (3)
Accumulative prediction error rates of VVM and the competing methods. on the Ionosphere dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 30 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 279 and 189 basis points, respectively.
Summary
• Efficient Bayesian online classification• A small constant space cost• A smooth trade-off between prediction
accuracy and computational cost• Improved prediction accuracy over alternative
methods• More flexible functional form for virtual
points, and other applications