ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang ([email protected]) Haiqin Yang, Irwin King, Michael.

ICML2004, Banff, Alberta, CanadaICML2004, Banff, Alberta, Canada

Learning Larger Margin Learning Larger Margin Machine Locally and Machine Locally and GloballyGlobally

Kaizhu Huang (Kaizhu Huang ([email protected]@cse.cuhk.edu.hk))Haiqin Yang, Irwin King, Michael R. LyuHaiqin Yang, Irwin King, Michael R. LyuDept. of Computer Science and EngineeringDept. of Computer Science and EngineeringThe Chinese University of Hong KongThe Chinese University of Hong Kong

July 5, 2004July 5, 2004

The Chinese University of Hong KongThe Chinese University of Hong Kong



Learning Larger Margin Learning Larger Margin Machine Locally and GloballyMachine Locally and Globally

ContributionsContributionsBackground:Background:– Linear Binary ClassificationLinear Binary Classification– MotivationMotivation

Maxi-Min Margin Machine(MMaxi-Min Margin Machine(M44))– Model DefinitionModel Definition– Geometrical InterpretationGeometrical Interpretation– Solving MethodsSolving Methods– Connections With Other ModelsConnections With Other Models– Nonseparable caseNonseparable case– KernelizationsKernelizations

Experimental ResultsExperimental ResultsFuture WorkFuture WorkConclusionConclusion



Theory:Theory: A unified model of Support Vector Machi A unified model of Support Vector Machine (SVM), Minimax Probability Machine (MPM), anne (SVM), Minimax Probability Machine (MPM), and Linear Discriminant Analysis (LDA).d Linear Discriminant Analysis (LDA).

Practice:Practice: A sequential Conic Programming Proble A sequential Conic Programming Problem.m.

ContributionsContributions



Background: Linear Binary Background: Linear Binary ClassificationClassification

Given two classes of data sampled from x and y, we are trying to find a linear decision plane wT z + b=0, which can correctly discriminate x from y.

wT z + b< 0, z is classified as y;

wT z + b >0, z is classified as x. wT z + b=0 : decision hyperplane

Only partial information is available, we need to choose a criterion to select hyperplanes

y

x



wT z + b=0

Background: Support Background: Support Vector MachineVector Machine

Margin

Support Vector Machines (SVM): The optimal hyperplane is the one which maximizes the margin between two classes of data

Support Vectors

The boundary of SVM is exclusively determined by several critical points called support vectors

All other points are totally irrelevant with the decision plane

SVM discards global information

x

y


Learning Locally and Learning Locally and GloballyGlobally


wT z + b=0y

x

Along the dashed axis, y data have a larger data trend than x data. Therefore, a more reasonable hyerplane may lie closer than x data rather than locating itself in the middle of two classes as in SVM.

SVM

A more reasonable hyperplane

Learning Locally and Globally



MM44: Learning Locally and : Learning Locally and GloballyGlobally



MM44: Geometric Interpretation: Geometric Interpretation



MM44: Solving Method: Solving Method

Divide and Conquer:

If we fix ρ to a specific ρn , the problem changes to check whether this ρn satisfies the following constraints:

If yes, we increase ρn; otherwise, we decrease it.

Second Order Cone Programming Problem!!!



MM44: Solving Method (Cont’): Solving Method (Cont’)

Iterate the following two Divide and Conquer steps:

Sequential Second Order Cone Programming Problem!!!


can it satisfy the constraints?

YesNo


MM44: Solving Method (Cont’): Solving Method (Cont’)



MM44: Links with MPM: Links with MPM

+

Span all the data points and add them

together

Exactly MPM Optimization Problem!!!



MM44: Links with MPM (Cont’): Links with MPM (Cont’)

MPM

M4

Remarks: The procedure is not reversible: MPM is a special case of M4

MPM focuses on building decision boundary GLOBALLY, i.e., it exclusively depends on the means and covariances. However, means and covariances may not be accurately estimated.


If one assumes ∑=I


MM44: Links with SVM: Links with SVM

The magnitude of w can scale up

without influencing the optimization

1

2

3

4

Support Vector Machines!!!

SVM is the special case of MM44



MM44: Links with SVM (Cont’): Links with SVM (Cont’)

These two assumptions of SVM are inappropriate

If one assumes ∑=I

Assumption 1Assumption 2



MM44: Links with LDA: Links with LDAIf one assumes

∑x=∑y=(∑*y+∑*x)/2

Perform a procedure similar to MPM…

LDA



MM44: Links with LDA (Cont’): Links with LDA (Cont’)

Assumption

Still inappropriate?If one assumes

∑x=∑y=(∑*y+∑*x)/2



Nonseparable CaseNonseparable CaseIntroducing slack variables

How to solve?? Line Search+Second Order Cone Programming



Nonlinear Classifier: KernelizationNonlinear Classifier: Kernelization• Map data to higher dimensional feature space Rf

xi(xi)

yi(xi)

• Construct the linear decision plane f(γ ,b)=γ T z + b in the feature space Rf, with γ Є Rf, b Є R•In Rf, we need to solve

• However, we do not want to solve this in an explicit form of . Instead, we want to solve it in a kernelization form

K(z1,z2)= (z1)T(z2)



Nonlinear Classifier: KernelizationNonlinear Classifier: Kernelization



Nonlinear Classifier: KernelizationNonlinear Classifier: Kernelization

Notation



Experimental Experimental ResultsResults

Toy Example: Two Gaussian Data with different data trends



Data sets: UCI Machine Learning RepositoryProcedures: 10-fold Cross validationSolving Package: SVM: Libsvm 2.4, M4: Sedumi 1.05 MPM: MPM 1.0

In linear cases, M4 outperforms SVM and MPM In Gaussian cases, M4 is slightly better or comparable than SVM (1). Sparsity in the feature space results in inaccurate estimation of covariance matrices (2) Kernelization may not keep data topology of the original data.—Maximizing Margin in the feature space does not necessarily maximize margin in the original space




From Simon Tong et al. Restricted Bayesian Optimal classifiers, AAAI, 2000.

An example to illustrate that maximizing Margin in the feature space does not necessarily maximize margin in the original space




Future WorkFuture Work

Speeding up MSpeeding up M44

Contain support vectors—can we employ its sparsity aContain support vectors—can we employ its sparsity as has been done in SVM?s has been done in SVM?

Can we reduce redundant points??Can we reduce redundant points??

How to impose constrains on the kernelization fHow to impose constrains on the kernelization for keeping the topology of data?or keeping the topology of data?

Generalization error bound?Generalization error bound? SVM and MPM have both error bounds.SVM and MPM have both error bounds.

How to extend to multi-category classifications?How to extend to multi-category classifications?



ConclusionConclusion

Proposed a new large margin classifier MProposed a new large margin classifier M44 which learns the decision boundary both which learns the decision boundary both locally and globallylocally and globally

Built theoretical connections with other Built theoretical connections with other models: A unified model of SVM, MPM and LDAmodels: A unified model of SVM, MPM and LDA

Developed sequential Second Order Cone Developed sequential Second Order Cone Programming algorithm for MProgramming algorithm for M44

Experimental results demonstrated the Experimental results demonstrated the advantages of our new modeladvantages of our new model


Thanks!Thanks!


ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang ([email protected]) Haiqin Yang, Irwin King, Michael.

Documents

ywt z b

x data

linear decision plane

y data

mpm optimization problem

decision plane svm

problem changes

larger data trend