Top Banner
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell
23

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Feb 14, 2016

Download

Documents

Ailis

Using Analytic QP and Sparseness to Speed Training of Support Vector Machines. John C. Platt Presented by: Travis Desell. Overview. Introduction Motivation General SVMs General SVM training Related Work Sequential Minimal Optimization (SMO) Choosing the smallest optimization problem - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Using Analytic QP and Sparseness to Speed Training of

Support Vector Machines

John C. Platt

Presented by: Travis Desell

Page 2: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Overview• Introduction

– Motivation– General SVMs– General SVM training– Related Work

• Sequential Minimal Optimization (SMO)– Choosing the smallest optimization problem– Solving the smallest optimization problem

• Benchmarks• Conclusion• Remarks & Future Work• References

Page 3: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Motivation

• Traditional SVM Training Algorithms– Require quadratic programming (QP) package– SVM training is slow, especially for large problems

• Sequential Minimal Optimization (SMO)– Requires no QP package– Easy to implement– Often faster– Good scalability properties

Page 4: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

General SVMs

u = i iyiK(xi,x) – b (1)• u : SVM output• : weights to blend different kernels• y in {-1, +1} : desired output• b : threshold• xi : stored training example (vector)• x : input (vector)• K : kernel function to measure similarity of xi to xi

Page 5: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

General SVMs (2)

• For linear SVMs, K is linear, thus (1) can be expressed as the dot product of w and x minus the threshold:

u = w * x – b(2)w = i iyixi (3)

• Where w, x, and xi are vectors

Page 6: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

General SVM Training• Training an SVM is finding i, expressed as

minimizing a dual quadratic form:min () = min ½ i j yiyjK(xi, xj)ij – ii (4)

• Subject to box constraints:0 <= i <= C, for all I (5)

• And the linear equality constraint:i yii = 0 (6)

• The i are Lagrange multipliers of a primal QP problem: there is a one-to-one correspondence between each i and each training example xi

Page 7: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

General SVM Training (2)

• SMO solves the QP expressed in (4-6)• Terminates when all of the Karush-Kuhn-Tucker

(KKT) optimality conditions are fulfilled:i = 0 -> yiui >= 1 (7)

0 < i < C -> yiui = 1 (8)

i = C -> yiui <= 1 (9)

• Where ui is the SVM output for the ith training example

Page 8: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Related Work• “Chunking” [9]

– Removing training examples with i = 0 does not change solution.– Breaks down large QP problem into smaller sub-problems to

identify non-zero i.– The QP sub-problem consists of every non-zero i from previous

sub-problem combined with M worst examples that violate (7-9) for some M [1].

– Last step solves the entire QP problem as all non-zero i have been found.

– Cannot handle large-scale training problems if standard QP techniques are used. Kaufman [3] describes QP algorithm to overcome this.

Page 9: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Related Work (2)

• Decomposition [6]:– Breaks the large QP problem into smaller QP sub-

problems.– Osuna et al. [6] suggest using fixed size matrix for

every sub-problem – allows very large training sets.– Joachims [2] suggests adding and subtracting examples

according to heuristics for rapid convergence.– Until SMO, requires numerical QP library, which can

be costly or slow.

Page 10: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Sequential Minimal Optimization

• SMO decomposes the overall QP problem (4-6), into fixed size QP sub-problems.

• Chooses the smallest optimization problem (SOP) at each step.– This optimization consists of two elements of

because of the linear equality constraint.

• SMO repeatedly chooses two elements of to jointly optimize until the overall QP problem is solved.

Page 11: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Choosing the SOP

• Heuristic based approach• Terminates when the entire training set

obeys (7-9) within (typically <= 10-3)• Repeatedly finds 1 and 2 and optimizes

until termination

Page 12: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Finding 1

• “First choice heuristic”– Searches through examples most likely to violate

conditions (non-bound subset)– i at the bounds likely to stay there, non-bound i will

move as others are optimized• “Shrinking Heuristic”

– Finds examples which fulfill (7-9) more than the worst example failed

– Ignores these examples until a final pass at the end to ensure all examples fulfill (7-9)

Page 13: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Finding 2

• Chosen to maximize the size of the step taken during the joint optimization of 1 and 2

• Each non-bound has a cached error value E for each non-bound example

• If E1 is negative, chooses 2 with minimum E2

• If E1 is positive, chooses 2 with maximum E2

Page 14: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Solving the SOP• Computes minimum along the direction of the

linear equality constant:2

new = y2(E1-E2)/(K(x1,x1)+K(x2,x2)–2K(x1, x2)) (10)Ei = ui-yi (11)

• Clips 2new within [L,H]:

L = max(0,2+s1-0.5(s+1)C) (12)H = min(C,2+s1-0.5(s-1)C) (13)

s = y1y2 (14)

• Calculates 1new:

1new = 1 + s(2 – 2

new,clipped) (15)

Page 15: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Benchmarks

• UCI Adult: SVM is given 14 attributes of a census and is asked to predict if household income is greater than $50k. 8 categorical attributes, 6 continues = 123 binary attributes.

• Web: classify if a web page is in a category or not. 300 sparse binary keyword attributes.

• MNIST: One classifier is trained. 784-dimensional, non-binary vectors stored as sparse vectors.

Page 16: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Description of Benchmarks

• Web and Adult are trained with linear and Gaussian SVMs.

• Performed with and without sparse inputs, with and without kernel caching

• PCG chunking always uses caching

Page 17: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Benchmarking SMO

Page 18: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Conclusions

• PCG chunking slower than SMO, SMO ignores examples whose Lagrange multipliers are at C.

• Overhead of PCG chunking not involved with kernel (kernel optimizations do not greatly effect time).

Page 19: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Conclusions (2)• SVMlight solves 10 dimensional QP sub-problems.• Differences mostly due to kernel optimizations

and numerical QP overhead.• SMO faster on linear problems due to linear SVM

folding, but SVMlight can potentially use this as well.

• SVMlight benefits from complex kernel cache while SVM does have a complex kernel cache and thus does not benefit from it at large problem sizes.

Page 20: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

Remarks & Future Work

• Heuristic based approach to finding 1 and 2 to optimize:– Possible to determine optimal choice strategy to

minimize the number of steps?

• Proof that SMO always minimizes the QP problem?

Page 21: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

References

• [1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 1998.

• [2] T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 169–184. MIT Press, 1998.

Page 22: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

References (2)• [3] L. Kaufman. Solving the quadratic

programming problem arising in support vector classification. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 147–168. MIT Press, 1998.

• [6] E. Osuna, R. Freund, and F. Girosi. Improved training algorithm for support vector machines. In Proc. IEEE Neural Networks in Signal Processing ’97, 1997.

Page 23: Using Analytic QP and Sparseness to Speed Training of Support Vector Machines

References (3)

• [9] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.