Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

Virtual Vector Machine for Bayesian Online Classification

Yuan (Alan) QiCS & StatisticsPurdue

June, 2009

Joint work with T.P. Minka and R. Xiang

MotivationUbiquitous data stream: emails, stock prices, images

from satellites, video surveillance

How to process a data stream using a small memory buffer and make accurate predictions?

Outline

• Introduction• Virtual Vector Machine• Experimental Results• Summary

Introduction

• Online learning:– Update model and make predictions based on

data points received sequentially– Use a fixed-size memory buffer

Classical online learning

• Classification: – Perceptron

• Linear regression:– Kalman filtering

Bayesian treatment

• Monte Carlo methods (e.g., particle filters)– Difficult for classification model due to high

dimensionality

• Deterministic methods:– Assumed density filtering: Gaussian process

classification models (Csato 2002).

Virtual Vector Machine preview

• Two parts: –Gaussian approximation factors–Virtual points for nonGaussian factors • Summarize multiple real data points• Flexible functional forms• Stored in data cache with a user-defined size.

Outline


Online Bayesian classification

Model parameters: Data from time 1 to T:Likelihood function at time t:Prior distribution: Posterior at time T:

Flipping noise model

• Labeling error rate• : Feature vector scaled by 1 or -1 depending on

the label. • Posterior distribution: planes cutting a sphere for

3-D case.

Gaussian approximation by EP

• approximates the likelihood

• Both and have the form of Gaussian. Therefore, is a Gaussian.

VVM enlarges approximation family

: virtual point : exact form of the original likelihood

function. (Could be more flexible.) : residue

Reduction to Gaussian

• From the augmented representation, we can reduce to a Gaussian by EP smoothing on virtual points with prior :

• is Gaussian too.

Cost function for finding virtual points

• Minimizing cost function with ADF spirit:

contains one more nonlinear factor than . • Maximizing surrogate function:

• Keep informative (non-Gaussian) information in virtual points.

Computationally intractable…

Cost function for finding virtual points

• Minimizing cost function with ADF spirit:

contains one more nonlinear factor than . • Maximizing surrogate function:

• Keep informative (non-Gaussian) information in virtual points.

Two basic operations

• Searching over all possible locations for : computationally expensive!

• For efficiency, consider only two operations to generate virtual points: – Eviction: delete the least informative point– Merging: merge two similar points to one

Eviction

• After adding the new point into the virtual point set, – Select by maximizing – Remove from the cache – Update the residual via

Version space for 3-D case

Version space: brown areaEP approximation: red ellipseFour data points: hyperplanes

Version space with three points after deleting one point (with the largest margin)

Merging

• Remove from the cache, • Insert the merged point into the cache• Update the residual via

where Gaussian residual term captures the lost information from the original two factors.

Equivalent to replace by

Version space for 3-D case

Version space: brown areaEP approximation: red ellipseFour data points: hyperplanes

Version space with three points after merging two similar points

Compute residue term

• Inverse ADF: match the moments of the left and right distributions:

Efficiently solved by Gauss-Newton method as an one-dimensional problem

Algorithm Summary

• Random feature expansion (Rahimi & Recht, 2007):

• For RBF kernels, we use random Fourier features:

• Where are sampled from a special .

Classification with random features

Outline


Estimation accuracy of posterior mean

Mean square error of estimated posterior mean obtained by EP, virtual vector machine , ADF and window-EP (W-EP). The exact posterior mean is obtained via a Monte Carlo method. The results are averaged over 20 runs.

Online classification (1)

Accumulative prediction error rates of VVM, the sparse online Gaussian process classier (SOGP), the Passive-Aggressive (PA) algorithm and the Topmoumoute online natural gradient (NG) algorithm on the Spambase dataset. The size of virtual point set used by VVM is 30, while the online Gaussian process model has 143 basis points.

Online nonlinear classification (2)

Accumulative prediction error rates of VVM and competing methods on the Thyroid dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 10 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 12 and 91 basis points, respectively.

Online nonlinear classification (3)

Accumulative prediction error rates of VVM and the competing methods. on the Ionosphere dataset. VVM, PA, and NG use the same random Fourier-Gaussian feature expansion (dimension 100). NG and VVM both use a buffer to cache 30 points, while the online Gaussian process model and the Passive-Aggressive algorithm have 279 and 189 basis points, respectively.

Summary

• Efficient Bayesian online classification• A small constant space cost• A smooth trade-off between prediction

accuracy and computational cost• Improved prediction accuracy over alternative

methods• More flexible functional form for virtual

points, and other applications

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

Documents

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.