Research Article A Robust Probability Classifier Based on the … · 2020. 1. 13. · Research Article A Robust Probability Classifier Based on the Modified 2-Distance YongzhiWang,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Research ArticleA Robust Probability Classifier Based onthe Modified 1205942-Distance
1 College of Instrumentation amp Electrical Engineering Jilin University Changchun 130061 China2Department of Automation TNList Tsinghua University Beijing 100084 China3Development and Research Center of China Geological Survey Beijing 100037 China4Key Laboratory of Geological Information Technology Ministry of Land and Resources Beijing 100037 China
Correspondence should be addressed to Yongzhi Wang iamwangyongzhi126com
Received 9 January 2014 Revised 5 April 2014 Accepted 7 April 2014 Published 30 April 2014
Academic Editor Hua-Peng Chen
Copyright copy 2014 Yongzhi Wang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
We propose a robust probability classifier model to address classification problems with data uncertainty A class-conditionalprobability distributional set is constructed based on the modified 120594
2-distance Based on a ldquolinear combination assumptionrdquo for theposterior class-conditional probabilities we consider a classification criterion using the weighted sum of the posterior probabilitiesAn optimal robust minimax classifier is defined as the one with the minimal worst-case absolute error loss function value overall possible distributions belonging to the constructed distributional set Based on the conic duality theorem we show that theresulted optimization problem can be reformulated into a second order cone programming problemwhich can be efficiently solvedby interior algorithms The robustness of the proposed model can avoid the ldquooverlearningrdquo phenomenon on training sets and thuskeep a comparable accuracy on test sets Numerical experiments validate the effectiveness of the proposed model and further showthat it also provides promising results on multiple classification problems
1 Introduction
Statistics classification has been extensively studied in thefield ofmachine learning and statistics A typical classificationproblem is to design a linear or nonlinear classifier basedon a known training set such that a new observation canbe assigned to one of the known classes Many classificationmodels have been proposed such as the naive Bayes classifiers(NBC) [1 2] artificial neural network [3] and support vectormachines (SVM) [4]
In real-world classification problems it is often the casethat the data of training set are imprecise due to unavoidableobservational noises in the process of data collection or dataapproximation from incomplete samples One way to handlethe data uncertainty is to design a robust classifier in thesense that it has the minimal worst-case misclassificationprobability for the training sets The idea of robustness hasbeen widely applied in many traditional machine learningand statistics techniques such as robust Bayes classifiers [5]
robust support vector machines [6] and robust quadraticregressions [7] Robust classifiers are highly related to therecently flourished research on robust optimization Formorerecent developments on robust optimization we refer thereaders to the excellent book [8] and reviews [9 10]
Recently [11 12] have proposed a robust minimaxapproach called the minimax probability machine to designa binary classifier Unlike the traditional methods they makeno assumption on the class-conditional distributions butonly the mean and covariance matrix of each class areassumed to be known Under this assumption the designedclassifier is determined by minimizing the worst-case proba-bility of misclassification under all possible choices of class-conditional distributions with the givenmean and covariancematrix By reformulating the classifier design problem intosecond order cone programming they show that the com-putational complexity of the proposed approach is similarto that of SVM Because of its computational advantage andcompetitive performance with other current methods this
Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2014 Article ID 621314 11 pageshttpdxdoiorg1011552014621314
2 Mathematical Problems in Engineering
approach has been further extended to incorporating otherfeatures El Ghaoui et al [13] propose a robust classificationmodel by minimizing the worst-case value of a given lossfunction over all possible choices of the data in a boundedhyperrectangles Three loss functions from SVM logisticregressions and minimax probability machines are studiedin [13] Based on the same assumption of known meanand covariance matrix [14 15] propose the biased minimaxprobability machines to address the biased classificationproblem and further generalize it to obtain the minimumerrorminimax probabilitymachinesHoi and Lyu [16] study aquadratic classifier with positive definite covariance matricesand further consider the problem of finding a convex set tocover known sampled data in one class while minimizing theworst-case misclassification probability The minimax prob-ability machines have also been extended to solve multipleclassification problems see [17 18]
In this paper we propose a robust probability classifier(RPC) based on the modified 120594
2-distance Specifically for agiven training set we first estimate the probability of eachsample belonging to each class based on a feature whichis called a nominal class-conditional distribution Then a 120598-confidence probability distributional set 119875
120598is constructed
based on the nominal class-conditional distributions and themodified 120594
2-distance where parameter 120598 controls the size ofthe constructed set Unlike the ldquoconditional independenceassumptionrdquo in NBC we introduce a ldquolinear combinationassumptionrdquo for the posterior class-conditional probabilitiesand the proposed classifier takes a linear combination formof these probabilities based on different features and it willassign the sample to the class with the maximal posteriorprobability To get a robust classifier we minimize the worst-case loss function value over all possible choices of class-conditional distributions over the distributional set 119875
120598 The
underlying assumption is that due to observational noiseswe cannot obtain the true probability distribution of eachclass but it can be well estimated by the nominal distributionsuch that it belongs to the distributional set119875
120598
Our two major contributions are as follows First inour model the proposed distributional set 119875
120598is based on
the nominal distribution and the modified 1205942-distance As
pointed in [19] such distributional set can make use of moreinformation conveyed in the training set compared withtraditional robust approaches which only use the informationofmean and covariancematrix To the best of our knowledgethis is among the first study of classification models con-sidering complex distribution information Although [20]considers a 120598-contaminated robust support vector machinemodel its distributional set is defined by easily handledlinear constraints and its analysis is highly dependent oncharacterization of the extreme points of this set Here ourproposed distributional set is defined by nonlinear quadraticfunction and is analyzed by the conic duality theorem Secondby taking the absolute error function as the loss functionwe show how to transform our robust minimax optimizationproblem into computable second order cone programmingThe absolute error function in the objective function alsodistinguishes our model from other existing models such
as the soft-margin support vector machine which uses theHinge loss function [21 22] and the logistic regression whichuses the negative log likelihood function [23] Note that theabsolute error function is essential in our model to obtaina tractable optimization problem for the proposed modelNumerical experiments on real-world application validatethe effectiveness of the proposed classifier and further showthat the proposed classifier also performs well for multipleclassification problems
The paper proceeds as follows Section 2 introduces theproposed robust minimax probability classifier based on themodified 120594
2-distance and discusses how to construct thedesired distributional set 119875
120598 Section 3 provides an equivalent
reformulation by handling the robust constraints and robustobjective separately Numerical experiments on real-worlddata set are carried out to validate the effectiveness of theproposed classifier in Section 4 Section 5 concludes thispaper and gives future research directions
2 Classifier Models
In this section a simple probability classifier is first presentedand then it is extended to handle data uncertainty byintroducing a distributional set 119875
120598 We also discuss how to
construct this distributional set based on training data setConsider a multiclass multifeature classification problem
in which each sample contains |119871| features and there are|119869| classes and |119868| samples Specifically given a training set(119883 119884) isin R|119868|times|119871|times0 1
|119868|times|119869| where 119909119894119897denotes the 119897th feature
of the 119894th sample and 119910119894119895
= 1 if the 119894th sample belongs to119895th class otherwise 119910
119894119895= 0 In the following context we
will also use the term 119909119894 to denote the 119894th sample that is
119909119894= (1199091198941
119909119894|119871|
)
21 Probability Classifier Bayes classifiers assign an observa-tion 119909 to the 119895
lowast(119909)th class which has the maximal posterior
probability that is
119895lowast
(119909) = arg max119895isin119869
119875 (119895 | 119909) (1)
and 119875(119895 | 119909) is the posterior probability function that isthe conditional probability that the sample belongs to the 119895thclass given that we know it has feature vector 119909
where 119875(119895) is the prior probability of the 119895th class 119875(119909 | 119895)
is the conditional probability for the 119894th class and 119875(119909) isthe probability that a sample has feature vector 119909 Note that119875(119909) is a constant if the values of the feature variables areknown and thus can be omitted To design an effective Bayesclassifier the key issue is estimating the class-conditionalprobability 119875(119909 | 119895) or the joint probability 119875(119909 119895) Theoreti-cally using the chain rule we have
However such estimating method leads to the problem ofldquodimension disasterrdquo
To address this issue the naive Bayes classifier makes thefollowing ldquoconditional independence assumptionrdquo
119875 (119909 | 119895) =
|119871|
prod
119897=1
119901119897
119895(119909) (4)
where 119901119897
119895(119909) = 119875(119909
119897| 119895) is the class-conditional probability
that the observation 119909 belongs to the 119895th class based on the119897th feature Here we introduce another ldquolinear combinationassumptionrdquo for the class-conditional probability
119875 (119909 | 119895) =
|119871|
sum
119897=1
120573119897
119895119901119897
119895(119909) (5)
where 120573119897
119895is a coefficient Compared with the ldquoconditional
independence assumptionrdquo which uses the probabilisticinformation in terms of multiplication the proposed ldquolinearcombination assumptionrdquo uses the probabilistic informationin terms of weighted sum We will further discuss therationality of this assumption at the end of this subsection
where 119871(sdot sdot) R timesR rarr 119877+is a prespecified loss function In
the following context we will take the absolute error functionas our loss function that is 119871(119909 119910) = |119909 minus 119910| In view ofits probability property it is straightforward to impose thefollowing constraints on the posterior probability
0 le 119891 (119895 | 119909119894) le 1 forall119894 isin 119868 119895 isin 119869 (8)
It is no doubt that the ldquolinear combination assumptionrdquomay not work sometimes However we justify the proposedclassifier by the following facts
(1) As an intuitive interpretation note that 119901119897
119895(119909) esti-
mates the probability of the observation 119909 belongingto the 119895th class only based on the 119897th feature thusit provides partial probabilistic information of thesample Hence we can interpret the weight 120572
119897
119895as
certain degree of trust on the information and in thissense the ldquolinear combination assumptionrdquo is a wayof combining evidence fromdifferent sources Similarideas can also be found in the theory of evidence seethe Dempster-Shafer theory [24 25]
(2) In terms of the classification performance in theworst case the proposed classifier may put all weighton one feature thus in such case it is equivalent toa Bayes classifier based on a well-selected feature Ifeach class has its ldquotypicalrdquo feature which can distin-guish it from other classes the proposed classifier hasthe ability to learn this property by putting differentweights on different features for different classes andthus provides better classification performance Areal-life application on lithology classification prob-lems also validates its classification performance bycomparison with support vector machines and thenaive Bayes classifier
(3) Another advantage of the proposed classifier is itshigh computability As we show in Section 3 the pro-posed classifier and its robust counterpart problemscan be reformulated as second order cone program-ming problems and thus can be solved by interioralgorithms in polynomial time
22 Robust Probability Classifier Due to observationalnoises the true class-conditional probability distribution isoften difficult to obtain Instead we can construct a confi-dence distributional set which contains the true distributionUnlike the traditional distributional sets in minimax prob-ability machines which only utilize mean and covariancematrix we construct our class-conditional probability distri-butional set based on the modified 120594
2-distance which usesmore information from the samples
4 Mathematical Problems in Engineering
The modified 1205942-distance 119889(sdot sdot) R119898 times R119898 rarr 119877 is
used tomeasure the distance between twodiscrete probabilitydistribution vectors in statistics For given 119901 = (119901
1 119901
119898)119879
and 119902 = (1199021 119902
119898)119879 it is defined as
119889 (119902 119901) =
119898
sum
119895=1
(119902119895
minus 119901119895)2
119901119895
(11)
Based on the modified 1205942-distance we present the following
119894119895is the nominal class-conditional distribution prob-
ability for the 119894th sample belonging to the 119895th class based onthe 119897th feature and the prespecified parameter 120598 is used tocontrol the size of the set
To design a robust classifier we need to consider the effectof data uncertainty on the objective function and constraintsThe robust objective function is to minimize the worst-case loss function value over all the possible distributionsin the distributional set 119875
120598 the robust constraints ensure
that all the original constraints should also be satisfied forany distribution in 119875
120598 Thus the robust probability classifier
problem is of the following form
(RPC) min
maxsum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| 119902119897
119894119895 isin 119875120598
st 0 le sum
119897isin119871
120572119897
119895119902119897
119894119895le 1 forall 119902
119897
119894119895 isin 119875120598
forall119894 119895
(13)
Note that the above optimization problem has an infinitenumber of robust constraints and its objective function is alsoan embedded subproblem We will show how to solve suchminimax optimization problem in Section 3
23 Construct the Distributional Set To get the distributionalset 119875120598 we need to define the parameter 120598 and the nominal
probability 119901119897
119894119895 The selection of parameter 120598 is application
based and we will discuss this issue in the numerical exper-iment section next we will provide a procedure to calculate119901119897
119894119895For the 119897th feature the following procedure takes an
integer 119870119897indicating the number of data intervals as an input
andwill output the estimated probability119901119897
119894119895of the 119894th sample
belonging to the 119895th class
(1) Sort samples in the increased order and divide theminto 119870
119897intervals such that each interval has at least
lfloor|119868|119870119897rfloor number of samples Denote the 119896th interval
by Δ119897119896
(2) Calculate the total number of samples in the 119895-class119873119895 the total number of samples in the 119896th interval
119873119897119896 and the total number of samples belonging to the
119895-class in the 119896th interval 119873119897119896119895
(3) For the 119894th sample if it falls into the 119896th interval the
Note that from the definition of 119875120598 we easily compute the
upper bound 119902119897
119894119895and lower bound 119902
119897
119894119895for the true class-
conditional probability 119902119897
119894119895as follows
119902119897
119894119895= max
119902119897
119894119895 sum
119904
119902119897
119894119904= 1
sum
119904isin119869
(119902119897
119894119904minus 119901119897
119894119904)2
119901119897
119894119904
le 120598 119902119897
119894119904ge 0 forall119904 isin 119869
(15)
119902119897
119894119895= min
119902119897
119894119895 sum
119904
119902119897
119894119904= 1
sum
119904isin119869
(119902119897
119894119904minus 119901119897
119894119904)2
119901119897
119894119904
le 120598 119902119897
119894119904ge 0 forall119904 isin 119869
(16)
The above problems can be efficiently solved by a secondorder cone solver such as SeDuMi [26] or SDPT3 [27]
3 Solution Methods for RPC
In this section we first reduce the infinite number of robustconstraints to a finite set of linear constraints and then trans-form the inner robust objective function into a minimizationproblem by the conic duality theorem At last we obtainan equivalent computable second order cone programmingfor the RPC problem The following analysis is based on thestrong duality result in [8]
Mathematical Problems in Engineering 5
Consider a conic program of the following form
(CP) min 119888119879119909
st 119860119894119909 minus 119887119894isin 119862119894 forall119894 = 1 119898
119860119909 = 119887
(17)
and its dual problem
(DP) max 119887119879119911 +
119898
sum
119894=1
119887119879
119894119910119894
st 119860lowast119911 +
119898
sum
119894=1
119860lowast
119894119910119894= 119888
119910119894isin 119862lowast
119894 forall119894 = 1 119898
(18)
where 119862119894is a cone in R119899119894 and 119862
lowast
119894is its dual cone defined by
119862lowast
119894= 119910 isin R
119899119894 119910119879119909 ge forall119909 isin 119862
119894 (19)
A conic program is called strictly feasible if it admits a feasiblesolution 119909 such that 119860
119894119909 minus 119887119894
isin int119862119894 forall119894 = 1 119898 where
int119862119894denotes the interior point set of 119862
119894
Lemma 1 (see [8]) If one of the problems (CP) and (DP) isstrictly feasible and bounded then the other problem is solvableand (CP) = (DP) in the sense that both have the same optimalobjective function value
31 Robust Constraints The following lemma provides anequivalent characterization for the infinite number of robustconstraints in terms of a finite set of linear constraints whichcan be solved efficiently
Lemma 2 For given 119894 119895 the robust constraint
0 le sum
119897isin119871
120572119897
119895119901119897
119894119895le 1 forall 119902
119897
119894119895 isin 119875120598 (20)
is equal to the following constraints
sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
1 + sum
119897isin119871
(119902119897
1198941198951199061198971
119894119895minus 119902119897
119894119895V1198971119894119895
) ge 0
V1198971119894119895
minus 120572119897
119894119895minus 1199061198971
119894119895ge 0 119906
1198971
119894119895 V1198971119894119895
ge 0 forall119897 isin 119871
(21)
Proof First note that the distributional set 119875120598119894can be repre-
sented as theCartesian product of a series of projected subsets
119875120598
= prod
119894isin119868
119875120598119894
(22)
where the projected subset on index 119894 is defined by
119875120598119894
=
119902119897
119894119895 sum
119895
119902119897
119894119895= 1 119902119897
119894119895ge 0
sum
119895isin119869
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
le 120598 forall119897 isin 119871 119895 isin 119869
(23)
Then for given 119894 119895 since the robust constraint is onlyassociated with variables 119902
119897
119894119895 119897 isin 119871 we can further split the
119894119895are computed by (15) and (16) respectively
For constraint sum119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall119902
119897
119894119895 isin 119875120598 it is equal to
the following constraint
sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894
lArrrArr sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894119895
lArrrArr minsum
119897isin119871
120572119897
119895119901119897
119894119895 119902119897
119894119895le 119902119897
119894119895le 119902119897
119894119895 forall119897 isin 119871 ge 0
lArrrArr maxsum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
)
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871 ge 0
lArrrArr sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
(25)
where the last equivalence comes from the strong dualitybetween these two linear programs
For the constraint sum119897isin119871
120572119897
119895119901119897
119894119895le 1 forall119902
119897
119894119895 isin 119875120598 the same
technique applies thus we complete the proof
32 Robust Objective Function In the RPC problem therobust objective function is defined by an innermaximizationproblem The following proposition shows that it can betransformed into a minimization problem over second ordercones To prove the following result we utilize the concept ofconjugate function 119889
lowast of the modified 1205942-distance
119889lowast
(119904) = sup119905ge0
119904119905 minus 119889 (119905) =[119904 + 2]
2
+
4minus 1 (26)
6 Mathematical Problems in Engineering
where the function [sdot]+is defined as [119909]
+= 119909 if 119909 ge
0 otherwise [119909]+
= 0 For more details about conjugatefunctions see [28]
Proposition 3 The following inner maximization problem
where a second order cone 119871119899+1 is defined as
119871119899+1
=
119909 isin R119899+1
119909119899+1
ge radic
119899
sum
119894=1
1199092
119894
(29)
Proof For given feasible 120572 satisfying the robust constraints itis straightforward to show that the inner maximum problemis equal to the following minimization problem (MP)
(MP) min 119905
st 119905 ge sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895 + |119868|
forall 119902119897
119894119895 isin 119875120598
(30)
The above constraint can be further reduced to the followingconstraint
max
sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| minus 119905 forall 119902119897
119894119895 isin 119875120598 le 0
(31)
By assigning Lagrange multipliers 120579119897
119894isin R and 120582
119897
119894isin R+
to the constraints in the left optimization problem we obtainthe following Lagrange function
119871 (119902 120579 120582) = sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
)
+ |119868| minus 119905
(32)
where 119903119897
119894119895= 120572119897
119895(1 minus 2119910
119894119895) + 120579119897
119894 Its dual function is given as
119863 (120579 120582) = max119902ge0
119871 (119905 119902 120579 120582)
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
max119902119897
119894119895ge0
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894119901119897
119894119895(
119902119897
119894119895minus 119901119897
119894119895
119901119897
119894119895
)
2
)
+ |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895max119905ge0
(119903119897
119894119895119905 minus 120582119897
119894(119905 minus 1)
2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894max119905ge0
(
119903119897
119894119895
120582119897
119894
119905 minus (119905 minus 1)2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894119889lowast
(
119903119897
119894119895
120582119897
119894
) + |119868| minus 119905
(33)
Note that for any feasible 120572 the primal maximizationproblem (31) is bounded and has a strictly feasible solution119901119897
119894119895 thus there is no duality gap between (31) and the
In this section numerical experiments on real-world appli-cations are carried out to verify the effectiveness of theproposed robust probability classifier model Specifically weconsider lithology classification data sets from our practicalapplication We compare our model with the regularizedSVM (RSVM) and the naive Bayes classifier (NBC) on bothbinary and multiple classification problems
All the numerical experiments are implemented in Mat-lab 770 and run on Intel(R) Core(TM) i5-4570 CPU SDPT3solver [27] is called to solve the second order cone programsin our proposed method and the regularized SVM
41 Data Sets Lithology classification is one of the basic tasksfor geological investigation To discriminate the lithology ofthe underground strata various electromagnetic techniquesare applied to the same strata to obtain different features suchas Gamma coefficients acoustic wave striation density andfusibility
Here numerical experiments are carried out on a seriesof data sets the borehole T1 Y4 Y5 and Y6 All boreholesare located in Tarim Basin China In total there are 12 datasets used for binary classification problems and 8 data setsused for multiple classification problems For each data setbased on a prespecified training rate 120574 isin [0 1] it is randomlypartitioned into two subsets a training set and a test set suchthat the size of training set accounts for 120574 of the total numberof samples
42 Experiment Design The parameters in our models arechosen based on the size of data setThe parameter 120598 dependson the number of the classes and defined as 120598 = 120575
2|119869| where
120575 isin (0 1)The choice of 120598 can be explained in this way if thereare |119869| classes and the training data are uniformly distributedthen for each probability 119901
119897
119894119895= 1|119869| its maximal variation
range is between 119901119897
119894119895(1 minus 120575) and 119901
119897
119894119895(1 + 120575) The number of
data intervals 119870119897is defined as 119870
119897= |119868|(|119869| times 119870) such that if
the training data are uniformly distributed then in each datainterval there are 119870 samples in each class In the followingcontext we set 120575 = 02 and 119870 = 8
We compare the performances of the proposed RPCmodel with the following regularized support vectormachinemodel [6] (take the 119895th class for example)
(RSVM) min sum
119894isin119868
120585119894119895
+ 120582119895
10038171003817100381710038171003817119908119895
10038171003817100381710038171003817
st 119910119894119895
(sum
119897isin119871
119908119897
119895119909119897
119894+ 119887119895) ge 1 minus 120585
119894119895 119894 isin 119868
120585119894119895
ge 0 119894 isin 119868
(38)
where 119910119894119895
= 2119910119894119895
minus1 and 120582119895
ge 0 is a regularization parameterAs pointed by [8] 120582
119895ge 0 represents a trade-off between the
number of training set errors and the amount of robustness
8 Mathematical Problems in Engineering
Table 1 Performances of RSVM NBC and RPC for binary classification problems on Y5 data set
tr () RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
with respect to spherical perturbations of the data pointsTo make a fair comparison in the following experiments wewill test a series of 120582 values and choose the one with bestperformance Note that if 120582
119895= 0 we refer to this model as the
classic support vector machine (SVM) See also [6] for moredetails onRSVMand its applications tomultiple classificationproblems
43 Test on Binary Classification In this subsection RSVMNBC and RPC are implemented on 12 data sets for the binaryclassification problems using the cross-validation methodsTo improve the performances of RSVM we transform theoriginal data by the popularly used polynomial kernels [6]
Tables 1 and 2 show the averaged classification per-formances of RSVM NBC and the proposed RPC (over10 randomly generated instances) for binary classificationproblems on Y5 and T1 data sets respectively For each dataset we randomly partition it into a training set and a testset based on the parameter tr which varies from 05 to 09The highest classification accuracy on a training set amongthese three methods is highlighted in bold while the bestclassification accuracy on a test set is marked with an asterisk
Tables 1 and 2 validate the effectiveness of the proposedRPC for binary classification problems compared with NBCand RSVM Specifically for most of the cases RSVM hasthe highest classification accuracy on training sets but itsperformance on test sets is unsatisfactory For most of thecases the proposed RPC provides the highest classification
accuracy on test sets NBC provides better performanceson test sets as the training rate increases The experimentalresults also show that for given training rate PRC can providebetter performances on test sets than that on training setsthus it can avoid the ldquooverlearningrdquo phenomenon
To further validate the effectiveness of the proposed RPCwe test it on additional 10 data sets that is T41ndashT45 andT61ndashT65 Table 3 reports the averaged performances of threemethods over 10 randomly generated instances when thetraining rate is set to 70 Except for data sets T45 T63and T64 RPC provides the highest accuracy on the test setsand for all the data sets its accuracy is higher than 80 Asshown in Tables 1 and 2 the robustness of the proposed RPCguarantees its scalability on the test sets
44 Test onMultiple Classification In this subsection we testthe performances of on multiple classification problems bycomparison with RSVM and NBC Since the performance ofRSVM is determined by its regularization parameter 120582 werun a set of RSVM with 120582 varying from 0 to a big enoughnumber and select the one with the best performance on testsets
Figures 1 and 3 plot the performances of three methodson Y5 and T1 training sets respectively Unlike the case ofbinary classification problems we can see that RPC providesa competitive performance even on the training sets Oneexplanation is that RSVM can outperform the proposed RPCon training sets by finding the optimal separation hyperplane
Mathematical Problems in Engineering 9
Table 3 Performances of RSVM NBC and RPC for binary classification problems on other data sets when tr = 70
Data set RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
Figure 1 Performances of RSVM NBC and RPC on Y5 trainingset
for binary classification problem S while RPC is more robustto extend to solve multiple classification problems since ituses the nonlinear probability information of the data setsThe accuracy of NBC on the training sets also improves asthe training rate increases
Figures 2 and 4 show the performances of both methodson Y5 and T1 test sets respectively We can see that for most
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
055
06
065
07
075
08
085
09
095
1
Accu
racy
on
test
set (
)
Figure 2 Performances of RSVM NBC and RPC on Y5 test set
of the cases RPC provides the highest accuracy among threemethods The accuracy of RSVM outperforms that of NBCon Y5 test set while the latter outperforms the former on theT1 test set
To further test the performance of PRC on multipleclassification problems we carry out more experiments ondata sets M1ndashM6 Table 4 reports the averaged performancesof three methods on these data sets when the training rateis set to 70 Except for the M5 data set PRC always
10 Mathematical Problems in Engineering
06
065
07
075
08
085
Accu
racy
on
trai
ning
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
approach has been further extended to incorporating otherfeatures El Ghaoui et al [13] propose a robust classificationmodel by minimizing the worst-case value of a given lossfunction over all possible choices of the data in a boundedhyperrectangles Three loss functions from SVM logisticregressions and minimax probability machines are studiedin [13] Based on the same assumption of known meanand covariance matrix [14 15] propose the biased minimaxprobability machines to address the biased classificationproblem and further generalize it to obtain the minimumerrorminimax probabilitymachinesHoi and Lyu [16] study aquadratic classifier with positive definite covariance matricesand further consider the problem of finding a convex set tocover known sampled data in one class while minimizing theworst-case misclassification probability The minimax prob-ability machines have also been extended to solve multipleclassification problems see [17 18]
In this paper we propose a robust probability classifier(RPC) based on the modified 120594
2-distance Specifically for agiven training set we first estimate the probability of eachsample belonging to each class based on a feature whichis called a nominal class-conditional distribution Then a 120598-confidence probability distributional set 119875
120598is constructed
based on the nominal class-conditional distributions and themodified 120594
2-distance where parameter 120598 controls the size ofthe constructed set Unlike the ldquoconditional independenceassumptionrdquo in NBC we introduce a ldquolinear combinationassumptionrdquo for the posterior class-conditional probabilitiesand the proposed classifier takes a linear combination formof these probabilities based on different features and it willassign the sample to the class with the maximal posteriorprobability To get a robust classifier we minimize the worst-case loss function value over all possible choices of class-conditional distributions over the distributional set 119875
120598 The
underlying assumption is that due to observational noiseswe cannot obtain the true probability distribution of eachclass but it can be well estimated by the nominal distributionsuch that it belongs to the distributional set119875
120598
Our two major contributions are as follows First inour model the proposed distributional set 119875
120598is based on
the nominal distribution and the modified 1205942-distance As
pointed in [19] such distributional set can make use of moreinformation conveyed in the training set compared withtraditional robust approaches which only use the informationofmean and covariancematrix To the best of our knowledgethis is among the first study of classification models con-sidering complex distribution information Although [20]considers a 120598-contaminated robust support vector machinemodel its distributional set is defined by easily handledlinear constraints and its analysis is highly dependent oncharacterization of the extreme points of this set Here ourproposed distributional set is defined by nonlinear quadraticfunction and is analyzed by the conic duality theorem Secondby taking the absolute error function as the loss functionwe show how to transform our robust minimax optimizationproblem into computable second order cone programmingThe absolute error function in the objective function alsodistinguishes our model from other existing models such
as the soft-margin support vector machine which uses theHinge loss function [21 22] and the logistic regression whichuses the negative log likelihood function [23] Note that theabsolute error function is essential in our model to obtaina tractable optimization problem for the proposed modelNumerical experiments on real-world application validatethe effectiveness of the proposed classifier and further showthat the proposed classifier also performs well for multipleclassification problems
The paper proceeds as follows Section 2 introduces theproposed robust minimax probability classifier based on themodified 120594
2-distance and discusses how to construct thedesired distributional set 119875
120598 Section 3 provides an equivalent
reformulation by handling the robust constraints and robustobjective separately Numerical experiments on real-worlddata set are carried out to validate the effectiveness of theproposed classifier in Section 4 Section 5 concludes thispaper and gives future research directions
2 Classifier Models
In this section a simple probability classifier is first presentedand then it is extended to handle data uncertainty byintroducing a distributional set 119875
120598 We also discuss how to
construct this distributional set based on training data setConsider a multiclass multifeature classification problem
in which each sample contains |119871| features and there are|119869| classes and |119868| samples Specifically given a training set(119883 119884) isin R|119868|times|119871|times0 1
|119868|times|119869| where 119909119894119897denotes the 119897th feature
of the 119894th sample and 119910119894119895
= 1 if the 119894th sample belongs to119895th class otherwise 119910
119894119895= 0 In the following context we
will also use the term 119909119894 to denote the 119894th sample that is
119909119894= (1199091198941
119909119894|119871|
)
21 Probability Classifier Bayes classifiers assign an observa-tion 119909 to the 119895
lowast(119909)th class which has the maximal posterior
probability that is
119895lowast
(119909) = arg max119895isin119869
119875 (119895 | 119909) (1)
and 119875(119895 | 119909) is the posterior probability function that isthe conditional probability that the sample belongs to the 119895thclass given that we know it has feature vector 119909
where 119875(119895) is the prior probability of the 119895th class 119875(119909 | 119895)
is the conditional probability for the 119894th class and 119875(119909) isthe probability that a sample has feature vector 119909 Note that119875(119909) is a constant if the values of the feature variables areknown and thus can be omitted To design an effective Bayesclassifier the key issue is estimating the class-conditionalprobability 119875(119909 | 119895) or the joint probability 119875(119909 119895) Theoreti-cally using the chain rule we have
However such estimating method leads to the problem ofldquodimension disasterrdquo
To address this issue the naive Bayes classifier makes thefollowing ldquoconditional independence assumptionrdquo
119875 (119909 | 119895) =
|119871|
prod
119897=1
119901119897
119895(119909) (4)
where 119901119897
119895(119909) = 119875(119909
119897| 119895) is the class-conditional probability
that the observation 119909 belongs to the 119895th class based on the119897th feature Here we introduce another ldquolinear combinationassumptionrdquo for the class-conditional probability
119875 (119909 | 119895) =
|119871|
sum
119897=1
120573119897
119895119901119897
119895(119909) (5)
where 120573119897
119895is a coefficient Compared with the ldquoconditional
independence assumptionrdquo which uses the probabilisticinformation in terms of multiplication the proposed ldquolinearcombination assumptionrdquo uses the probabilistic informationin terms of weighted sum We will further discuss therationality of this assumption at the end of this subsection
where 119871(sdot sdot) R timesR rarr 119877+is a prespecified loss function In
the following context we will take the absolute error functionas our loss function that is 119871(119909 119910) = |119909 minus 119910| In view ofits probability property it is straightforward to impose thefollowing constraints on the posterior probability
0 le 119891 (119895 | 119909119894) le 1 forall119894 isin 119868 119895 isin 119869 (8)
It is no doubt that the ldquolinear combination assumptionrdquomay not work sometimes However we justify the proposedclassifier by the following facts
(1) As an intuitive interpretation note that 119901119897
119895(119909) esti-
mates the probability of the observation 119909 belongingto the 119895th class only based on the 119897th feature thusit provides partial probabilistic information of thesample Hence we can interpret the weight 120572
119897
119895as
certain degree of trust on the information and in thissense the ldquolinear combination assumptionrdquo is a wayof combining evidence fromdifferent sources Similarideas can also be found in the theory of evidence seethe Dempster-Shafer theory [24 25]
(2) In terms of the classification performance in theworst case the proposed classifier may put all weighton one feature thus in such case it is equivalent toa Bayes classifier based on a well-selected feature Ifeach class has its ldquotypicalrdquo feature which can distin-guish it from other classes the proposed classifier hasthe ability to learn this property by putting differentweights on different features for different classes andthus provides better classification performance Areal-life application on lithology classification prob-lems also validates its classification performance bycomparison with support vector machines and thenaive Bayes classifier
(3) Another advantage of the proposed classifier is itshigh computability As we show in Section 3 the pro-posed classifier and its robust counterpart problemscan be reformulated as second order cone program-ming problems and thus can be solved by interioralgorithms in polynomial time
22 Robust Probability Classifier Due to observationalnoises the true class-conditional probability distribution isoften difficult to obtain Instead we can construct a confi-dence distributional set which contains the true distributionUnlike the traditional distributional sets in minimax prob-ability machines which only utilize mean and covariancematrix we construct our class-conditional probability distri-butional set based on the modified 120594
2-distance which usesmore information from the samples
4 Mathematical Problems in Engineering
The modified 1205942-distance 119889(sdot sdot) R119898 times R119898 rarr 119877 is
used tomeasure the distance between twodiscrete probabilitydistribution vectors in statistics For given 119901 = (119901
1 119901
119898)119879
and 119902 = (1199021 119902
119898)119879 it is defined as
119889 (119902 119901) =
119898
sum
119895=1
(119902119895
minus 119901119895)2
119901119895
(11)
Based on the modified 1205942-distance we present the following
119894119895is the nominal class-conditional distribution prob-
ability for the 119894th sample belonging to the 119895th class based onthe 119897th feature and the prespecified parameter 120598 is used tocontrol the size of the set
To design a robust classifier we need to consider the effectof data uncertainty on the objective function and constraintsThe robust objective function is to minimize the worst-case loss function value over all the possible distributionsin the distributional set 119875
120598 the robust constraints ensure
that all the original constraints should also be satisfied forany distribution in 119875
120598 Thus the robust probability classifier
problem is of the following form
(RPC) min
maxsum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| 119902119897
119894119895 isin 119875120598
st 0 le sum
119897isin119871
120572119897
119895119902119897
119894119895le 1 forall 119902
119897
119894119895 isin 119875120598
forall119894 119895
(13)
Note that the above optimization problem has an infinitenumber of robust constraints and its objective function is alsoan embedded subproblem We will show how to solve suchminimax optimization problem in Section 3
23 Construct the Distributional Set To get the distributionalset 119875120598 we need to define the parameter 120598 and the nominal
probability 119901119897
119894119895 The selection of parameter 120598 is application
based and we will discuss this issue in the numerical exper-iment section next we will provide a procedure to calculate119901119897
119894119895For the 119897th feature the following procedure takes an
integer 119870119897indicating the number of data intervals as an input
andwill output the estimated probability119901119897
119894119895of the 119894th sample
belonging to the 119895th class
(1) Sort samples in the increased order and divide theminto 119870
119897intervals such that each interval has at least
lfloor|119868|119870119897rfloor number of samples Denote the 119896th interval
by Δ119897119896
(2) Calculate the total number of samples in the 119895-class119873119895 the total number of samples in the 119896th interval
119873119897119896 and the total number of samples belonging to the
119895-class in the 119896th interval 119873119897119896119895
(3) For the 119894th sample if it falls into the 119896th interval the
Note that from the definition of 119875120598 we easily compute the
upper bound 119902119897
119894119895and lower bound 119902
119897
119894119895for the true class-
conditional probability 119902119897
119894119895as follows
119902119897
119894119895= max
119902119897
119894119895 sum
119904
119902119897
119894119904= 1
sum
119904isin119869
(119902119897
119894119904minus 119901119897
119894119904)2
119901119897
119894119904
le 120598 119902119897
119894119904ge 0 forall119904 isin 119869
(15)
119902119897
119894119895= min
119902119897
119894119895 sum
119904
119902119897
119894119904= 1
sum
119904isin119869
(119902119897
119894119904minus 119901119897
119894119904)2
119901119897
119894119904
le 120598 119902119897
119894119904ge 0 forall119904 isin 119869
(16)
The above problems can be efficiently solved by a secondorder cone solver such as SeDuMi [26] or SDPT3 [27]
3 Solution Methods for RPC
In this section we first reduce the infinite number of robustconstraints to a finite set of linear constraints and then trans-form the inner robust objective function into a minimizationproblem by the conic duality theorem At last we obtainan equivalent computable second order cone programmingfor the RPC problem The following analysis is based on thestrong duality result in [8]
Mathematical Problems in Engineering 5
Consider a conic program of the following form
(CP) min 119888119879119909
st 119860119894119909 minus 119887119894isin 119862119894 forall119894 = 1 119898
119860119909 = 119887
(17)
and its dual problem
(DP) max 119887119879119911 +
119898
sum
119894=1
119887119879
119894119910119894
st 119860lowast119911 +
119898
sum
119894=1
119860lowast
119894119910119894= 119888
119910119894isin 119862lowast
119894 forall119894 = 1 119898
(18)
where 119862119894is a cone in R119899119894 and 119862
lowast
119894is its dual cone defined by
119862lowast
119894= 119910 isin R
119899119894 119910119879119909 ge forall119909 isin 119862
119894 (19)
A conic program is called strictly feasible if it admits a feasiblesolution 119909 such that 119860
119894119909 minus 119887119894
isin int119862119894 forall119894 = 1 119898 where
int119862119894denotes the interior point set of 119862
119894
Lemma 1 (see [8]) If one of the problems (CP) and (DP) isstrictly feasible and bounded then the other problem is solvableand (CP) = (DP) in the sense that both have the same optimalobjective function value
31 Robust Constraints The following lemma provides anequivalent characterization for the infinite number of robustconstraints in terms of a finite set of linear constraints whichcan be solved efficiently
Lemma 2 For given 119894 119895 the robust constraint
0 le sum
119897isin119871
120572119897
119895119901119897
119894119895le 1 forall 119902
119897
119894119895 isin 119875120598 (20)
is equal to the following constraints
sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
1 + sum
119897isin119871
(119902119897
1198941198951199061198971
119894119895minus 119902119897
119894119895V1198971119894119895
) ge 0
V1198971119894119895
minus 120572119897
119894119895minus 1199061198971
119894119895ge 0 119906
1198971
119894119895 V1198971119894119895
ge 0 forall119897 isin 119871
(21)
Proof First note that the distributional set 119875120598119894can be repre-
sented as theCartesian product of a series of projected subsets
119875120598
= prod
119894isin119868
119875120598119894
(22)
where the projected subset on index 119894 is defined by
119875120598119894
=
119902119897
119894119895 sum
119895
119902119897
119894119895= 1 119902119897
119894119895ge 0
sum
119895isin119869
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
le 120598 forall119897 isin 119871 119895 isin 119869
(23)
Then for given 119894 119895 since the robust constraint is onlyassociated with variables 119902
119897
119894119895 119897 isin 119871 we can further split the
119894119895are computed by (15) and (16) respectively
For constraint sum119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall119902
119897
119894119895 isin 119875120598 it is equal to
the following constraint
sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894
lArrrArr sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894119895
lArrrArr minsum
119897isin119871
120572119897
119895119901119897
119894119895 119902119897
119894119895le 119902119897
119894119895le 119902119897
119894119895 forall119897 isin 119871 ge 0
lArrrArr maxsum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
)
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871 ge 0
lArrrArr sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
(25)
where the last equivalence comes from the strong dualitybetween these two linear programs
For the constraint sum119897isin119871
120572119897
119895119901119897
119894119895le 1 forall119902
119897
119894119895 isin 119875120598 the same
technique applies thus we complete the proof
32 Robust Objective Function In the RPC problem therobust objective function is defined by an innermaximizationproblem The following proposition shows that it can betransformed into a minimization problem over second ordercones To prove the following result we utilize the concept ofconjugate function 119889
lowast of the modified 1205942-distance
119889lowast
(119904) = sup119905ge0
119904119905 minus 119889 (119905) =[119904 + 2]
2
+
4minus 1 (26)
6 Mathematical Problems in Engineering
where the function [sdot]+is defined as [119909]
+= 119909 if 119909 ge
0 otherwise [119909]+
= 0 For more details about conjugatefunctions see [28]
Proposition 3 The following inner maximization problem
where a second order cone 119871119899+1 is defined as
119871119899+1
=
119909 isin R119899+1
119909119899+1
ge radic
119899
sum
119894=1
1199092
119894
(29)
Proof For given feasible 120572 satisfying the robust constraints itis straightforward to show that the inner maximum problemis equal to the following minimization problem (MP)
(MP) min 119905
st 119905 ge sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895 + |119868|
forall 119902119897
119894119895 isin 119875120598
(30)
The above constraint can be further reduced to the followingconstraint
max
sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| minus 119905 forall 119902119897
119894119895 isin 119875120598 le 0
(31)
By assigning Lagrange multipliers 120579119897
119894isin R and 120582
119897
119894isin R+
to the constraints in the left optimization problem we obtainthe following Lagrange function
119871 (119902 120579 120582) = sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
)
+ |119868| minus 119905
(32)
where 119903119897
119894119895= 120572119897
119895(1 minus 2119910
119894119895) + 120579119897
119894 Its dual function is given as
119863 (120579 120582) = max119902ge0
119871 (119905 119902 120579 120582)
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
max119902119897
119894119895ge0
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894119901119897
119894119895(
119902119897
119894119895minus 119901119897
119894119895
119901119897
119894119895
)
2
)
+ |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895max119905ge0
(119903119897
119894119895119905 minus 120582119897
119894(119905 minus 1)
2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894max119905ge0
(
119903119897
119894119895
120582119897
119894
119905 minus (119905 minus 1)2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894119889lowast
(
119903119897
119894119895
120582119897
119894
) + |119868| minus 119905
(33)
Note that for any feasible 120572 the primal maximizationproblem (31) is bounded and has a strictly feasible solution119901119897
119894119895 thus there is no duality gap between (31) and the
In this section numerical experiments on real-world appli-cations are carried out to verify the effectiveness of theproposed robust probability classifier model Specifically weconsider lithology classification data sets from our practicalapplication We compare our model with the regularizedSVM (RSVM) and the naive Bayes classifier (NBC) on bothbinary and multiple classification problems
All the numerical experiments are implemented in Mat-lab 770 and run on Intel(R) Core(TM) i5-4570 CPU SDPT3solver [27] is called to solve the second order cone programsin our proposed method and the regularized SVM
41 Data Sets Lithology classification is one of the basic tasksfor geological investigation To discriminate the lithology ofthe underground strata various electromagnetic techniquesare applied to the same strata to obtain different features suchas Gamma coefficients acoustic wave striation density andfusibility
Here numerical experiments are carried out on a seriesof data sets the borehole T1 Y4 Y5 and Y6 All boreholesare located in Tarim Basin China In total there are 12 datasets used for binary classification problems and 8 data setsused for multiple classification problems For each data setbased on a prespecified training rate 120574 isin [0 1] it is randomlypartitioned into two subsets a training set and a test set suchthat the size of training set accounts for 120574 of the total numberof samples
42 Experiment Design The parameters in our models arechosen based on the size of data setThe parameter 120598 dependson the number of the classes and defined as 120598 = 120575
2|119869| where
120575 isin (0 1)The choice of 120598 can be explained in this way if thereare |119869| classes and the training data are uniformly distributedthen for each probability 119901
119897
119894119895= 1|119869| its maximal variation
range is between 119901119897
119894119895(1 minus 120575) and 119901
119897
119894119895(1 + 120575) The number of
data intervals 119870119897is defined as 119870
119897= |119868|(|119869| times 119870) such that if
the training data are uniformly distributed then in each datainterval there are 119870 samples in each class In the followingcontext we set 120575 = 02 and 119870 = 8
We compare the performances of the proposed RPCmodel with the following regularized support vectormachinemodel [6] (take the 119895th class for example)
(RSVM) min sum
119894isin119868
120585119894119895
+ 120582119895
10038171003817100381710038171003817119908119895
10038171003817100381710038171003817
st 119910119894119895
(sum
119897isin119871
119908119897
119895119909119897
119894+ 119887119895) ge 1 minus 120585
119894119895 119894 isin 119868
120585119894119895
ge 0 119894 isin 119868
(38)
where 119910119894119895
= 2119910119894119895
minus1 and 120582119895
ge 0 is a regularization parameterAs pointed by [8] 120582
119895ge 0 represents a trade-off between the
number of training set errors and the amount of robustness
8 Mathematical Problems in Engineering
Table 1 Performances of RSVM NBC and RPC for binary classification problems on Y5 data set
tr () RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
with respect to spherical perturbations of the data pointsTo make a fair comparison in the following experiments wewill test a series of 120582 values and choose the one with bestperformance Note that if 120582
119895= 0 we refer to this model as the
classic support vector machine (SVM) See also [6] for moredetails onRSVMand its applications tomultiple classificationproblems
43 Test on Binary Classification In this subsection RSVMNBC and RPC are implemented on 12 data sets for the binaryclassification problems using the cross-validation methodsTo improve the performances of RSVM we transform theoriginal data by the popularly used polynomial kernels [6]
Tables 1 and 2 show the averaged classification per-formances of RSVM NBC and the proposed RPC (over10 randomly generated instances) for binary classificationproblems on Y5 and T1 data sets respectively For each dataset we randomly partition it into a training set and a testset based on the parameter tr which varies from 05 to 09The highest classification accuracy on a training set amongthese three methods is highlighted in bold while the bestclassification accuracy on a test set is marked with an asterisk
Tables 1 and 2 validate the effectiveness of the proposedRPC for binary classification problems compared with NBCand RSVM Specifically for most of the cases RSVM hasthe highest classification accuracy on training sets but itsperformance on test sets is unsatisfactory For most of thecases the proposed RPC provides the highest classification
accuracy on test sets NBC provides better performanceson test sets as the training rate increases The experimentalresults also show that for given training rate PRC can providebetter performances on test sets than that on training setsthus it can avoid the ldquooverlearningrdquo phenomenon
To further validate the effectiveness of the proposed RPCwe test it on additional 10 data sets that is T41ndashT45 andT61ndashT65 Table 3 reports the averaged performances of threemethods over 10 randomly generated instances when thetraining rate is set to 70 Except for data sets T45 T63and T64 RPC provides the highest accuracy on the test setsand for all the data sets its accuracy is higher than 80 Asshown in Tables 1 and 2 the robustness of the proposed RPCguarantees its scalability on the test sets
44 Test onMultiple Classification In this subsection we testthe performances of on multiple classification problems bycomparison with RSVM and NBC Since the performance ofRSVM is determined by its regularization parameter 120582 werun a set of RSVM with 120582 varying from 0 to a big enoughnumber and select the one with the best performance on testsets
Figures 1 and 3 plot the performances of three methodson Y5 and T1 training sets respectively Unlike the case ofbinary classification problems we can see that RPC providesa competitive performance even on the training sets Oneexplanation is that RSVM can outperform the proposed RPCon training sets by finding the optimal separation hyperplane
Mathematical Problems in Engineering 9
Table 3 Performances of RSVM NBC and RPC for binary classification problems on other data sets when tr = 70
Data set RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
Figure 1 Performances of RSVM NBC and RPC on Y5 trainingset
for binary classification problem S while RPC is more robustto extend to solve multiple classification problems since ituses the nonlinear probability information of the data setsThe accuracy of NBC on the training sets also improves asthe training rate increases
Figures 2 and 4 show the performances of both methodson Y5 and T1 test sets respectively We can see that for most
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
055
06
065
07
075
08
085
09
095
1
Accu
racy
on
test
set (
)
Figure 2 Performances of RSVM NBC and RPC on Y5 test set
of the cases RPC provides the highest accuracy among threemethods The accuracy of RSVM outperforms that of NBCon Y5 test set while the latter outperforms the former on theT1 test set
To further test the performance of PRC on multipleclassification problems we carry out more experiments ondata sets M1ndashM6 Table 4 reports the averaged performancesof three methods on these data sets when the training rateis set to 70 Except for the M5 data set PRC always
10 Mathematical Problems in Engineering
06
065
07
075
08
085
Accu
racy
on
trai
ning
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
However such estimating method leads to the problem ofldquodimension disasterrdquo
To address this issue the naive Bayes classifier makes thefollowing ldquoconditional independence assumptionrdquo
119875 (119909 | 119895) =
|119871|
prod
119897=1
119901119897
119895(119909) (4)
where 119901119897
119895(119909) = 119875(119909
119897| 119895) is the class-conditional probability
that the observation 119909 belongs to the 119895th class based on the119897th feature Here we introduce another ldquolinear combinationassumptionrdquo for the class-conditional probability
119875 (119909 | 119895) =
|119871|
sum
119897=1
120573119897
119895119901119897
119895(119909) (5)
where 120573119897
119895is a coefficient Compared with the ldquoconditional
independence assumptionrdquo which uses the probabilisticinformation in terms of multiplication the proposed ldquolinearcombination assumptionrdquo uses the probabilistic informationin terms of weighted sum We will further discuss therationality of this assumption at the end of this subsection
where 119871(sdot sdot) R timesR rarr 119877+is a prespecified loss function In
the following context we will take the absolute error functionas our loss function that is 119871(119909 119910) = |119909 minus 119910| In view ofits probability property it is straightforward to impose thefollowing constraints on the posterior probability
0 le 119891 (119895 | 119909119894) le 1 forall119894 isin 119868 119895 isin 119869 (8)
It is no doubt that the ldquolinear combination assumptionrdquomay not work sometimes However we justify the proposedclassifier by the following facts
(1) As an intuitive interpretation note that 119901119897
119895(119909) esti-
mates the probability of the observation 119909 belongingto the 119895th class only based on the 119897th feature thusit provides partial probabilistic information of thesample Hence we can interpret the weight 120572
119897
119895as
certain degree of trust on the information and in thissense the ldquolinear combination assumptionrdquo is a wayof combining evidence fromdifferent sources Similarideas can also be found in the theory of evidence seethe Dempster-Shafer theory [24 25]
(2) In terms of the classification performance in theworst case the proposed classifier may put all weighton one feature thus in such case it is equivalent toa Bayes classifier based on a well-selected feature Ifeach class has its ldquotypicalrdquo feature which can distin-guish it from other classes the proposed classifier hasthe ability to learn this property by putting differentweights on different features for different classes andthus provides better classification performance Areal-life application on lithology classification prob-lems also validates its classification performance bycomparison with support vector machines and thenaive Bayes classifier
(3) Another advantage of the proposed classifier is itshigh computability As we show in Section 3 the pro-posed classifier and its robust counterpart problemscan be reformulated as second order cone program-ming problems and thus can be solved by interioralgorithms in polynomial time
22 Robust Probability Classifier Due to observationalnoises the true class-conditional probability distribution isoften difficult to obtain Instead we can construct a confi-dence distributional set which contains the true distributionUnlike the traditional distributional sets in minimax prob-ability machines which only utilize mean and covariancematrix we construct our class-conditional probability distri-butional set based on the modified 120594
2-distance which usesmore information from the samples
4 Mathematical Problems in Engineering
The modified 1205942-distance 119889(sdot sdot) R119898 times R119898 rarr 119877 is
used tomeasure the distance between twodiscrete probabilitydistribution vectors in statistics For given 119901 = (119901
1 119901
119898)119879
and 119902 = (1199021 119902
119898)119879 it is defined as
119889 (119902 119901) =
119898
sum
119895=1
(119902119895
minus 119901119895)2
119901119895
(11)
Based on the modified 1205942-distance we present the following
119894119895is the nominal class-conditional distribution prob-
ability for the 119894th sample belonging to the 119895th class based onthe 119897th feature and the prespecified parameter 120598 is used tocontrol the size of the set
To design a robust classifier we need to consider the effectof data uncertainty on the objective function and constraintsThe robust objective function is to minimize the worst-case loss function value over all the possible distributionsin the distributional set 119875
120598 the robust constraints ensure
that all the original constraints should also be satisfied forany distribution in 119875
120598 Thus the robust probability classifier
problem is of the following form
(RPC) min
maxsum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| 119902119897
119894119895 isin 119875120598
st 0 le sum
119897isin119871
120572119897
119895119902119897
119894119895le 1 forall 119902
119897
119894119895 isin 119875120598
forall119894 119895
(13)
Note that the above optimization problem has an infinitenumber of robust constraints and its objective function is alsoan embedded subproblem We will show how to solve suchminimax optimization problem in Section 3
23 Construct the Distributional Set To get the distributionalset 119875120598 we need to define the parameter 120598 and the nominal
probability 119901119897
119894119895 The selection of parameter 120598 is application
based and we will discuss this issue in the numerical exper-iment section next we will provide a procedure to calculate119901119897
119894119895For the 119897th feature the following procedure takes an
integer 119870119897indicating the number of data intervals as an input
andwill output the estimated probability119901119897
119894119895of the 119894th sample
belonging to the 119895th class
(1) Sort samples in the increased order and divide theminto 119870
119897intervals such that each interval has at least
lfloor|119868|119870119897rfloor number of samples Denote the 119896th interval
by Δ119897119896
(2) Calculate the total number of samples in the 119895-class119873119895 the total number of samples in the 119896th interval
119873119897119896 and the total number of samples belonging to the
119895-class in the 119896th interval 119873119897119896119895
(3) For the 119894th sample if it falls into the 119896th interval the
Note that from the definition of 119875120598 we easily compute the
upper bound 119902119897
119894119895and lower bound 119902
119897
119894119895for the true class-
conditional probability 119902119897
119894119895as follows
119902119897
119894119895= max
119902119897
119894119895 sum
119904
119902119897
119894119904= 1
sum
119904isin119869
(119902119897
119894119904minus 119901119897
119894119904)2
119901119897
119894119904
le 120598 119902119897
119894119904ge 0 forall119904 isin 119869
(15)
119902119897
119894119895= min
119902119897
119894119895 sum
119904
119902119897
119894119904= 1
sum
119904isin119869
(119902119897
119894119904minus 119901119897
119894119904)2
119901119897
119894119904
le 120598 119902119897
119894119904ge 0 forall119904 isin 119869
(16)
The above problems can be efficiently solved by a secondorder cone solver such as SeDuMi [26] or SDPT3 [27]
3 Solution Methods for RPC
In this section we first reduce the infinite number of robustconstraints to a finite set of linear constraints and then trans-form the inner robust objective function into a minimizationproblem by the conic duality theorem At last we obtainan equivalent computable second order cone programmingfor the RPC problem The following analysis is based on thestrong duality result in [8]
Mathematical Problems in Engineering 5
Consider a conic program of the following form
(CP) min 119888119879119909
st 119860119894119909 minus 119887119894isin 119862119894 forall119894 = 1 119898
119860119909 = 119887
(17)
and its dual problem
(DP) max 119887119879119911 +
119898
sum
119894=1
119887119879
119894119910119894
st 119860lowast119911 +
119898
sum
119894=1
119860lowast
119894119910119894= 119888
119910119894isin 119862lowast
119894 forall119894 = 1 119898
(18)
where 119862119894is a cone in R119899119894 and 119862
lowast
119894is its dual cone defined by
119862lowast
119894= 119910 isin R
119899119894 119910119879119909 ge forall119909 isin 119862
119894 (19)
A conic program is called strictly feasible if it admits a feasiblesolution 119909 such that 119860
119894119909 minus 119887119894
isin int119862119894 forall119894 = 1 119898 where
int119862119894denotes the interior point set of 119862
119894
Lemma 1 (see [8]) If one of the problems (CP) and (DP) isstrictly feasible and bounded then the other problem is solvableand (CP) = (DP) in the sense that both have the same optimalobjective function value
31 Robust Constraints The following lemma provides anequivalent characterization for the infinite number of robustconstraints in terms of a finite set of linear constraints whichcan be solved efficiently
Lemma 2 For given 119894 119895 the robust constraint
0 le sum
119897isin119871
120572119897
119895119901119897
119894119895le 1 forall 119902
119897
119894119895 isin 119875120598 (20)
is equal to the following constraints
sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
1 + sum
119897isin119871
(119902119897
1198941198951199061198971
119894119895minus 119902119897
119894119895V1198971119894119895
) ge 0
V1198971119894119895
minus 120572119897
119894119895minus 1199061198971
119894119895ge 0 119906
1198971
119894119895 V1198971119894119895
ge 0 forall119897 isin 119871
(21)
Proof First note that the distributional set 119875120598119894can be repre-
sented as theCartesian product of a series of projected subsets
119875120598
= prod
119894isin119868
119875120598119894
(22)
where the projected subset on index 119894 is defined by
119875120598119894
=
119902119897
119894119895 sum
119895
119902119897
119894119895= 1 119902119897
119894119895ge 0
sum
119895isin119869
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
le 120598 forall119897 isin 119871 119895 isin 119869
(23)
Then for given 119894 119895 since the robust constraint is onlyassociated with variables 119902
119897
119894119895 119897 isin 119871 we can further split the
119894119895are computed by (15) and (16) respectively
For constraint sum119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall119902
119897
119894119895 isin 119875120598 it is equal to
the following constraint
sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894
lArrrArr sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894119895
lArrrArr minsum
119897isin119871
120572119897
119895119901119897
119894119895 119902119897
119894119895le 119902119897
119894119895le 119902119897
119894119895 forall119897 isin 119871 ge 0
lArrrArr maxsum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
)
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871 ge 0
lArrrArr sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
(25)
where the last equivalence comes from the strong dualitybetween these two linear programs
For the constraint sum119897isin119871
120572119897
119895119901119897
119894119895le 1 forall119902
119897
119894119895 isin 119875120598 the same
technique applies thus we complete the proof
32 Robust Objective Function In the RPC problem therobust objective function is defined by an innermaximizationproblem The following proposition shows that it can betransformed into a minimization problem over second ordercones To prove the following result we utilize the concept ofconjugate function 119889
lowast of the modified 1205942-distance
119889lowast
(119904) = sup119905ge0
119904119905 minus 119889 (119905) =[119904 + 2]
2
+
4minus 1 (26)
6 Mathematical Problems in Engineering
where the function [sdot]+is defined as [119909]
+= 119909 if 119909 ge
0 otherwise [119909]+
= 0 For more details about conjugatefunctions see [28]
Proposition 3 The following inner maximization problem
where a second order cone 119871119899+1 is defined as
119871119899+1
=
119909 isin R119899+1
119909119899+1
ge radic
119899
sum
119894=1
1199092
119894
(29)
Proof For given feasible 120572 satisfying the robust constraints itis straightforward to show that the inner maximum problemis equal to the following minimization problem (MP)
(MP) min 119905
st 119905 ge sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895 + |119868|
forall 119902119897
119894119895 isin 119875120598
(30)
The above constraint can be further reduced to the followingconstraint
max
sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| minus 119905 forall 119902119897
119894119895 isin 119875120598 le 0
(31)
By assigning Lagrange multipliers 120579119897
119894isin R and 120582
119897
119894isin R+
to the constraints in the left optimization problem we obtainthe following Lagrange function
119871 (119902 120579 120582) = sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
)
+ |119868| minus 119905
(32)
where 119903119897
119894119895= 120572119897
119895(1 minus 2119910
119894119895) + 120579119897
119894 Its dual function is given as
119863 (120579 120582) = max119902ge0
119871 (119905 119902 120579 120582)
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
max119902119897
119894119895ge0
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894119901119897
119894119895(
119902119897
119894119895minus 119901119897
119894119895
119901119897
119894119895
)
2
)
+ |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895max119905ge0
(119903119897
119894119895119905 minus 120582119897
119894(119905 minus 1)
2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894max119905ge0
(
119903119897
119894119895
120582119897
119894
119905 minus (119905 minus 1)2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894119889lowast
(
119903119897
119894119895
120582119897
119894
) + |119868| minus 119905
(33)
Note that for any feasible 120572 the primal maximizationproblem (31) is bounded and has a strictly feasible solution119901119897
119894119895 thus there is no duality gap between (31) and the
In this section numerical experiments on real-world appli-cations are carried out to verify the effectiveness of theproposed robust probability classifier model Specifically weconsider lithology classification data sets from our practicalapplication We compare our model with the regularizedSVM (RSVM) and the naive Bayes classifier (NBC) on bothbinary and multiple classification problems
All the numerical experiments are implemented in Mat-lab 770 and run on Intel(R) Core(TM) i5-4570 CPU SDPT3solver [27] is called to solve the second order cone programsin our proposed method and the regularized SVM
41 Data Sets Lithology classification is one of the basic tasksfor geological investigation To discriminate the lithology ofthe underground strata various electromagnetic techniquesare applied to the same strata to obtain different features suchas Gamma coefficients acoustic wave striation density andfusibility
Here numerical experiments are carried out on a seriesof data sets the borehole T1 Y4 Y5 and Y6 All boreholesare located in Tarim Basin China In total there are 12 datasets used for binary classification problems and 8 data setsused for multiple classification problems For each data setbased on a prespecified training rate 120574 isin [0 1] it is randomlypartitioned into two subsets a training set and a test set suchthat the size of training set accounts for 120574 of the total numberof samples
42 Experiment Design The parameters in our models arechosen based on the size of data setThe parameter 120598 dependson the number of the classes and defined as 120598 = 120575
2|119869| where
120575 isin (0 1)The choice of 120598 can be explained in this way if thereare |119869| classes and the training data are uniformly distributedthen for each probability 119901
119897
119894119895= 1|119869| its maximal variation
range is between 119901119897
119894119895(1 minus 120575) and 119901
119897
119894119895(1 + 120575) The number of
data intervals 119870119897is defined as 119870
119897= |119868|(|119869| times 119870) such that if
the training data are uniformly distributed then in each datainterval there are 119870 samples in each class In the followingcontext we set 120575 = 02 and 119870 = 8
We compare the performances of the proposed RPCmodel with the following regularized support vectormachinemodel [6] (take the 119895th class for example)
(RSVM) min sum
119894isin119868
120585119894119895
+ 120582119895
10038171003817100381710038171003817119908119895
10038171003817100381710038171003817
st 119910119894119895
(sum
119897isin119871
119908119897
119895119909119897
119894+ 119887119895) ge 1 minus 120585
119894119895 119894 isin 119868
120585119894119895
ge 0 119894 isin 119868
(38)
where 119910119894119895
= 2119910119894119895
minus1 and 120582119895
ge 0 is a regularization parameterAs pointed by [8] 120582
119895ge 0 represents a trade-off between the
number of training set errors and the amount of robustness
8 Mathematical Problems in Engineering
Table 1 Performances of RSVM NBC and RPC for binary classification problems on Y5 data set
tr () RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
with respect to spherical perturbations of the data pointsTo make a fair comparison in the following experiments wewill test a series of 120582 values and choose the one with bestperformance Note that if 120582
119895= 0 we refer to this model as the
classic support vector machine (SVM) See also [6] for moredetails onRSVMand its applications tomultiple classificationproblems
43 Test on Binary Classification In this subsection RSVMNBC and RPC are implemented on 12 data sets for the binaryclassification problems using the cross-validation methodsTo improve the performances of RSVM we transform theoriginal data by the popularly used polynomial kernels [6]
Tables 1 and 2 show the averaged classification per-formances of RSVM NBC and the proposed RPC (over10 randomly generated instances) for binary classificationproblems on Y5 and T1 data sets respectively For each dataset we randomly partition it into a training set and a testset based on the parameter tr which varies from 05 to 09The highest classification accuracy on a training set amongthese three methods is highlighted in bold while the bestclassification accuracy on a test set is marked with an asterisk
Tables 1 and 2 validate the effectiveness of the proposedRPC for binary classification problems compared with NBCand RSVM Specifically for most of the cases RSVM hasthe highest classification accuracy on training sets but itsperformance on test sets is unsatisfactory For most of thecases the proposed RPC provides the highest classification
accuracy on test sets NBC provides better performanceson test sets as the training rate increases The experimentalresults also show that for given training rate PRC can providebetter performances on test sets than that on training setsthus it can avoid the ldquooverlearningrdquo phenomenon
To further validate the effectiveness of the proposed RPCwe test it on additional 10 data sets that is T41ndashT45 andT61ndashT65 Table 3 reports the averaged performances of threemethods over 10 randomly generated instances when thetraining rate is set to 70 Except for data sets T45 T63and T64 RPC provides the highest accuracy on the test setsand for all the data sets its accuracy is higher than 80 Asshown in Tables 1 and 2 the robustness of the proposed RPCguarantees its scalability on the test sets
44 Test onMultiple Classification In this subsection we testthe performances of on multiple classification problems bycomparison with RSVM and NBC Since the performance ofRSVM is determined by its regularization parameter 120582 werun a set of RSVM with 120582 varying from 0 to a big enoughnumber and select the one with the best performance on testsets
Figures 1 and 3 plot the performances of three methodson Y5 and T1 training sets respectively Unlike the case ofbinary classification problems we can see that RPC providesa competitive performance even on the training sets Oneexplanation is that RSVM can outperform the proposed RPCon training sets by finding the optimal separation hyperplane
Mathematical Problems in Engineering 9
Table 3 Performances of RSVM NBC and RPC for binary classification problems on other data sets when tr = 70
Data set RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
Figure 1 Performances of RSVM NBC and RPC on Y5 trainingset
for binary classification problem S while RPC is more robustto extend to solve multiple classification problems since ituses the nonlinear probability information of the data setsThe accuracy of NBC on the training sets also improves asthe training rate increases
Figures 2 and 4 show the performances of both methodson Y5 and T1 test sets respectively We can see that for most
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
055
06
065
07
075
08
085
09
095
1
Accu
racy
on
test
set (
)
Figure 2 Performances of RSVM NBC and RPC on Y5 test set
of the cases RPC provides the highest accuracy among threemethods The accuracy of RSVM outperforms that of NBCon Y5 test set while the latter outperforms the former on theT1 test set
To further test the performance of PRC on multipleclassification problems we carry out more experiments ondata sets M1ndashM6 Table 4 reports the averaged performancesof three methods on these data sets when the training rateis set to 70 Except for the M5 data set PRC always
10 Mathematical Problems in Engineering
06
065
07
075
08
085
Accu
racy
on
trai
ning
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
119894119895is the nominal class-conditional distribution prob-
ability for the 119894th sample belonging to the 119895th class based onthe 119897th feature and the prespecified parameter 120598 is used tocontrol the size of the set
To design a robust classifier we need to consider the effectof data uncertainty on the objective function and constraintsThe robust objective function is to minimize the worst-case loss function value over all the possible distributionsin the distributional set 119875
120598 the robust constraints ensure
that all the original constraints should also be satisfied forany distribution in 119875
120598 Thus the robust probability classifier
problem is of the following form
(RPC) min
maxsum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| 119902119897
119894119895 isin 119875120598
st 0 le sum
119897isin119871
120572119897
119895119902119897
119894119895le 1 forall 119902
119897
119894119895 isin 119875120598
forall119894 119895
(13)
Note that the above optimization problem has an infinitenumber of robust constraints and its objective function is alsoan embedded subproblem We will show how to solve suchminimax optimization problem in Section 3
23 Construct the Distributional Set To get the distributionalset 119875120598 we need to define the parameter 120598 and the nominal
probability 119901119897
119894119895 The selection of parameter 120598 is application
based and we will discuss this issue in the numerical exper-iment section next we will provide a procedure to calculate119901119897
119894119895For the 119897th feature the following procedure takes an
integer 119870119897indicating the number of data intervals as an input
andwill output the estimated probability119901119897
119894119895of the 119894th sample
belonging to the 119895th class
(1) Sort samples in the increased order and divide theminto 119870
119897intervals such that each interval has at least
lfloor|119868|119870119897rfloor number of samples Denote the 119896th interval
by Δ119897119896
(2) Calculate the total number of samples in the 119895-class119873119895 the total number of samples in the 119896th interval
119873119897119896 and the total number of samples belonging to the
119895-class in the 119896th interval 119873119897119896119895
(3) For the 119894th sample if it falls into the 119896th interval the
Note that from the definition of 119875120598 we easily compute the
upper bound 119902119897
119894119895and lower bound 119902
119897
119894119895for the true class-
conditional probability 119902119897
119894119895as follows
119902119897
119894119895= max
119902119897
119894119895 sum
119904
119902119897
119894119904= 1
sum
119904isin119869
(119902119897
119894119904minus 119901119897
119894119904)2
119901119897
119894119904
le 120598 119902119897
119894119904ge 0 forall119904 isin 119869
(15)
119902119897
119894119895= min
119902119897
119894119895 sum
119904
119902119897
119894119904= 1
sum
119904isin119869
(119902119897
119894119904minus 119901119897
119894119904)2
119901119897
119894119904
le 120598 119902119897
119894119904ge 0 forall119904 isin 119869
(16)
The above problems can be efficiently solved by a secondorder cone solver such as SeDuMi [26] or SDPT3 [27]
3 Solution Methods for RPC
In this section we first reduce the infinite number of robustconstraints to a finite set of linear constraints and then trans-form the inner robust objective function into a minimizationproblem by the conic duality theorem At last we obtainan equivalent computable second order cone programmingfor the RPC problem The following analysis is based on thestrong duality result in [8]
Mathematical Problems in Engineering 5
Consider a conic program of the following form
(CP) min 119888119879119909
st 119860119894119909 minus 119887119894isin 119862119894 forall119894 = 1 119898
119860119909 = 119887
(17)
and its dual problem
(DP) max 119887119879119911 +
119898
sum
119894=1
119887119879
119894119910119894
st 119860lowast119911 +
119898
sum
119894=1
119860lowast
119894119910119894= 119888
119910119894isin 119862lowast
119894 forall119894 = 1 119898
(18)
where 119862119894is a cone in R119899119894 and 119862
lowast
119894is its dual cone defined by
119862lowast
119894= 119910 isin R
119899119894 119910119879119909 ge forall119909 isin 119862
119894 (19)
A conic program is called strictly feasible if it admits a feasiblesolution 119909 such that 119860
119894119909 minus 119887119894
isin int119862119894 forall119894 = 1 119898 where
int119862119894denotes the interior point set of 119862
119894
Lemma 1 (see [8]) If one of the problems (CP) and (DP) isstrictly feasible and bounded then the other problem is solvableand (CP) = (DP) in the sense that both have the same optimalobjective function value
31 Robust Constraints The following lemma provides anequivalent characterization for the infinite number of robustconstraints in terms of a finite set of linear constraints whichcan be solved efficiently
Lemma 2 For given 119894 119895 the robust constraint
0 le sum
119897isin119871
120572119897
119895119901119897
119894119895le 1 forall 119902
119897
119894119895 isin 119875120598 (20)
is equal to the following constraints
sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
1 + sum
119897isin119871
(119902119897
1198941198951199061198971
119894119895minus 119902119897
119894119895V1198971119894119895
) ge 0
V1198971119894119895
minus 120572119897
119894119895minus 1199061198971
119894119895ge 0 119906
1198971
119894119895 V1198971119894119895
ge 0 forall119897 isin 119871
(21)
Proof First note that the distributional set 119875120598119894can be repre-
sented as theCartesian product of a series of projected subsets
119875120598
= prod
119894isin119868
119875120598119894
(22)
where the projected subset on index 119894 is defined by
119875120598119894
=
119902119897
119894119895 sum
119895
119902119897
119894119895= 1 119902119897
119894119895ge 0
sum
119895isin119869
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
le 120598 forall119897 isin 119871 119895 isin 119869
(23)
Then for given 119894 119895 since the robust constraint is onlyassociated with variables 119902
119897
119894119895 119897 isin 119871 we can further split the
119894119895are computed by (15) and (16) respectively
For constraint sum119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall119902
119897
119894119895 isin 119875120598 it is equal to
the following constraint
sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894
lArrrArr sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894119895
lArrrArr minsum
119897isin119871
120572119897
119895119901119897
119894119895 119902119897
119894119895le 119902119897
119894119895le 119902119897
119894119895 forall119897 isin 119871 ge 0
lArrrArr maxsum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
)
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871 ge 0
lArrrArr sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
(25)
where the last equivalence comes from the strong dualitybetween these two linear programs
For the constraint sum119897isin119871
120572119897
119895119901119897
119894119895le 1 forall119902
119897
119894119895 isin 119875120598 the same
technique applies thus we complete the proof
32 Robust Objective Function In the RPC problem therobust objective function is defined by an innermaximizationproblem The following proposition shows that it can betransformed into a minimization problem over second ordercones To prove the following result we utilize the concept ofconjugate function 119889
lowast of the modified 1205942-distance
119889lowast
(119904) = sup119905ge0
119904119905 minus 119889 (119905) =[119904 + 2]
2
+
4minus 1 (26)
6 Mathematical Problems in Engineering
where the function [sdot]+is defined as [119909]
+= 119909 if 119909 ge
0 otherwise [119909]+
= 0 For more details about conjugatefunctions see [28]
Proposition 3 The following inner maximization problem
where a second order cone 119871119899+1 is defined as
119871119899+1
=
119909 isin R119899+1
119909119899+1
ge radic
119899
sum
119894=1
1199092
119894
(29)
Proof For given feasible 120572 satisfying the robust constraints itis straightforward to show that the inner maximum problemis equal to the following minimization problem (MP)
(MP) min 119905
st 119905 ge sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895 + |119868|
forall 119902119897
119894119895 isin 119875120598
(30)
The above constraint can be further reduced to the followingconstraint
max
sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| minus 119905 forall 119902119897
119894119895 isin 119875120598 le 0
(31)
By assigning Lagrange multipliers 120579119897
119894isin R and 120582
119897
119894isin R+
to the constraints in the left optimization problem we obtainthe following Lagrange function
119871 (119902 120579 120582) = sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
)
+ |119868| minus 119905
(32)
where 119903119897
119894119895= 120572119897
119895(1 minus 2119910
119894119895) + 120579119897
119894 Its dual function is given as
119863 (120579 120582) = max119902ge0
119871 (119905 119902 120579 120582)
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
max119902119897
119894119895ge0
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894119901119897
119894119895(
119902119897
119894119895minus 119901119897
119894119895
119901119897
119894119895
)
2
)
+ |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895max119905ge0
(119903119897
119894119895119905 minus 120582119897
119894(119905 minus 1)
2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894max119905ge0
(
119903119897
119894119895
120582119897
119894
119905 minus (119905 minus 1)2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894119889lowast
(
119903119897
119894119895
120582119897
119894
) + |119868| minus 119905
(33)
Note that for any feasible 120572 the primal maximizationproblem (31) is bounded and has a strictly feasible solution119901119897
119894119895 thus there is no duality gap between (31) and the
In this section numerical experiments on real-world appli-cations are carried out to verify the effectiveness of theproposed robust probability classifier model Specifically weconsider lithology classification data sets from our practicalapplication We compare our model with the regularizedSVM (RSVM) and the naive Bayes classifier (NBC) on bothbinary and multiple classification problems
All the numerical experiments are implemented in Mat-lab 770 and run on Intel(R) Core(TM) i5-4570 CPU SDPT3solver [27] is called to solve the second order cone programsin our proposed method and the regularized SVM
41 Data Sets Lithology classification is one of the basic tasksfor geological investigation To discriminate the lithology ofthe underground strata various electromagnetic techniquesare applied to the same strata to obtain different features suchas Gamma coefficients acoustic wave striation density andfusibility
Here numerical experiments are carried out on a seriesof data sets the borehole T1 Y4 Y5 and Y6 All boreholesare located in Tarim Basin China In total there are 12 datasets used for binary classification problems and 8 data setsused for multiple classification problems For each data setbased on a prespecified training rate 120574 isin [0 1] it is randomlypartitioned into two subsets a training set and a test set suchthat the size of training set accounts for 120574 of the total numberof samples
42 Experiment Design The parameters in our models arechosen based on the size of data setThe parameter 120598 dependson the number of the classes and defined as 120598 = 120575
2|119869| where
120575 isin (0 1)The choice of 120598 can be explained in this way if thereare |119869| classes and the training data are uniformly distributedthen for each probability 119901
119897
119894119895= 1|119869| its maximal variation
range is between 119901119897
119894119895(1 minus 120575) and 119901
119897
119894119895(1 + 120575) The number of
data intervals 119870119897is defined as 119870
119897= |119868|(|119869| times 119870) such that if
the training data are uniformly distributed then in each datainterval there are 119870 samples in each class In the followingcontext we set 120575 = 02 and 119870 = 8
We compare the performances of the proposed RPCmodel with the following regularized support vectormachinemodel [6] (take the 119895th class for example)
(RSVM) min sum
119894isin119868
120585119894119895
+ 120582119895
10038171003817100381710038171003817119908119895
10038171003817100381710038171003817
st 119910119894119895
(sum
119897isin119871
119908119897
119895119909119897
119894+ 119887119895) ge 1 minus 120585
119894119895 119894 isin 119868
120585119894119895
ge 0 119894 isin 119868
(38)
where 119910119894119895
= 2119910119894119895
minus1 and 120582119895
ge 0 is a regularization parameterAs pointed by [8] 120582
119895ge 0 represents a trade-off between the
number of training set errors and the amount of robustness
8 Mathematical Problems in Engineering
Table 1 Performances of RSVM NBC and RPC for binary classification problems on Y5 data set
tr () RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
with respect to spherical perturbations of the data pointsTo make a fair comparison in the following experiments wewill test a series of 120582 values and choose the one with bestperformance Note that if 120582
119895= 0 we refer to this model as the
classic support vector machine (SVM) See also [6] for moredetails onRSVMand its applications tomultiple classificationproblems
43 Test on Binary Classification In this subsection RSVMNBC and RPC are implemented on 12 data sets for the binaryclassification problems using the cross-validation methodsTo improve the performances of RSVM we transform theoriginal data by the popularly used polynomial kernels [6]
Tables 1 and 2 show the averaged classification per-formances of RSVM NBC and the proposed RPC (over10 randomly generated instances) for binary classificationproblems on Y5 and T1 data sets respectively For each dataset we randomly partition it into a training set and a testset based on the parameter tr which varies from 05 to 09The highest classification accuracy on a training set amongthese three methods is highlighted in bold while the bestclassification accuracy on a test set is marked with an asterisk
Tables 1 and 2 validate the effectiveness of the proposedRPC for binary classification problems compared with NBCand RSVM Specifically for most of the cases RSVM hasthe highest classification accuracy on training sets but itsperformance on test sets is unsatisfactory For most of thecases the proposed RPC provides the highest classification
accuracy on test sets NBC provides better performanceson test sets as the training rate increases The experimentalresults also show that for given training rate PRC can providebetter performances on test sets than that on training setsthus it can avoid the ldquooverlearningrdquo phenomenon
To further validate the effectiveness of the proposed RPCwe test it on additional 10 data sets that is T41ndashT45 andT61ndashT65 Table 3 reports the averaged performances of threemethods over 10 randomly generated instances when thetraining rate is set to 70 Except for data sets T45 T63and T64 RPC provides the highest accuracy on the test setsand for all the data sets its accuracy is higher than 80 Asshown in Tables 1 and 2 the robustness of the proposed RPCguarantees its scalability on the test sets
44 Test onMultiple Classification In this subsection we testthe performances of on multiple classification problems bycomparison with RSVM and NBC Since the performance ofRSVM is determined by its regularization parameter 120582 werun a set of RSVM with 120582 varying from 0 to a big enoughnumber and select the one with the best performance on testsets
Figures 1 and 3 plot the performances of three methodson Y5 and T1 training sets respectively Unlike the case ofbinary classification problems we can see that RPC providesa competitive performance even on the training sets Oneexplanation is that RSVM can outperform the proposed RPCon training sets by finding the optimal separation hyperplane
Mathematical Problems in Engineering 9
Table 3 Performances of RSVM NBC and RPC for binary classification problems on other data sets when tr = 70
Data set RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
Figure 1 Performances of RSVM NBC and RPC on Y5 trainingset
for binary classification problem S while RPC is more robustto extend to solve multiple classification problems since ituses the nonlinear probability information of the data setsThe accuracy of NBC on the training sets also improves asthe training rate increases
Figures 2 and 4 show the performances of both methodson Y5 and T1 test sets respectively We can see that for most
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
055
06
065
07
075
08
085
09
095
1
Accu
racy
on
test
set (
)
Figure 2 Performances of RSVM NBC and RPC on Y5 test set
of the cases RPC provides the highest accuracy among threemethods The accuracy of RSVM outperforms that of NBCon Y5 test set while the latter outperforms the former on theT1 test set
To further test the performance of PRC on multipleclassification problems we carry out more experiments ondata sets M1ndashM6 Table 4 reports the averaged performancesof three methods on these data sets when the training rateis set to 70 Except for the M5 data set PRC always
10 Mathematical Problems in Engineering
06
065
07
075
08
085
Accu
racy
on
trai
ning
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
st 119860119894119909 minus 119887119894isin 119862119894 forall119894 = 1 119898
119860119909 = 119887
(17)
and its dual problem
(DP) max 119887119879119911 +
119898
sum
119894=1
119887119879
119894119910119894
st 119860lowast119911 +
119898
sum
119894=1
119860lowast
119894119910119894= 119888
119910119894isin 119862lowast
119894 forall119894 = 1 119898
(18)
where 119862119894is a cone in R119899119894 and 119862
lowast
119894is its dual cone defined by
119862lowast
119894= 119910 isin R
119899119894 119910119879119909 ge forall119909 isin 119862
119894 (19)
A conic program is called strictly feasible if it admits a feasiblesolution 119909 such that 119860
119894119909 minus 119887119894
isin int119862119894 forall119894 = 1 119898 where
int119862119894denotes the interior point set of 119862
119894
Lemma 1 (see [8]) If one of the problems (CP) and (DP) isstrictly feasible and bounded then the other problem is solvableand (CP) = (DP) in the sense that both have the same optimalobjective function value
31 Robust Constraints The following lemma provides anequivalent characterization for the infinite number of robustconstraints in terms of a finite set of linear constraints whichcan be solved efficiently
Lemma 2 For given 119894 119895 the robust constraint
0 le sum
119897isin119871
120572119897
119895119901119897
119894119895le 1 forall 119902
119897
119894119895 isin 119875120598 (20)
is equal to the following constraints
sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
1 + sum
119897isin119871
(119902119897
1198941198951199061198971
119894119895minus 119902119897
119894119895V1198971119894119895
) ge 0
V1198971119894119895
minus 120572119897
119894119895minus 1199061198971
119894119895ge 0 119906
1198971
119894119895 V1198971119894119895
ge 0 forall119897 isin 119871
(21)
Proof First note that the distributional set 119875120598119894can be repre-
sented as theCartesian product of a series of projected subsets
119875120598
= prod
119894isin119868
119875120598119894
(22)
where the projected subset on index 119894 is defined by
119875120598119894
=
119902119897
119894119895 sum
119895
119902119897
119894119895= 1 119902119897
119894119895ge 0
sum
119895isin119869
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
le 120598 forall119897 isin 119871 119895 isin 119869
(23)
Then for given 119894 119895 since the robust constraint is onlyassociated with variables 119902
119897
119894119895 119897 isin 119871 we can further split the
119894119895are computed by (15) and (16) respectively
For constraint sum119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall119902
119897
119894119895 isin 119875120598 it is equal to
the following constraint
sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894
lArrrArr sum
119897isin119871
120572119897
119895119901119897
119894119895ge 0 forall 119902
119897
119894119895 isin 119875120598119894119895
lArrrArr minsum
119897isin119871
120572119897
119895119901119897
119894119895 119902119897
119894119895le 119902119897
119894119895le 119902119897
119894119895 forall119897 isin 119871 ge 0
lArrrArr maxsum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
)
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871 ge 0
lArrrArr sum
119897isin119871
(119902119897
1198941198951199061198970
119894119895minus 119902119897
119894119895V1198970119894119895
) ge 0
120572119897
119894119895minus 1199061198970
119894119895+ V1198970119894119895
ge 0 1199061198970
119894119895 V1198970119894119895
ge 0 forall119897 isin 119871
(25)
where the last equivalence comes from the strong dualitybetween these two linear programs
For the constraint sum119897isin119871
120572119897
119895119901119897
119894119895le 1 forall119902
119897
119894119895 isin 119875120598 the same
technique applies thus we complete the proof
32 Robust Objective Function In the RPC problem therobust objective function is defined by an innermaximizationproblem The following proposition shows that it can betransformed into a minimization problem over second ordercones To prove the following result we utilize the concept ofconjugate function 119889
lowast of the modified 1205942-distance
119889lowast
(119904) = sup119905ge0
119904119905 minus 119889 (119905) =[119904 + 2]
2
+
4minus 1 (26)
6 Mathematical Problems in Engineering
where the function [sdot]+is defined as [119909]
+= 119909 if 119909 ge
0 otherwise [119909]+
= 0 For more details about conjugatefunctions see [28]
Proposition 3 The following inner maximization problem
where a second order cone 119871119899+1 is defined as
119871119899+1
=
119909 isin R119899+1
119909119899+1
ge radic
119899
sum
119894=1
1199092
119894
(29)
Proof For given feasible 120572 satisfying the robust constraints itis straightforward to show that the inner maximum problemis equal to the following minimization problem (MP)
(MP) min 119905
st 119905 ge sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895 + |119868|
forall 119902119897
119894119895 isin 119875120598
(30)
The above constraint can be further reduced to the followingconstraint
max
sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| minus 119905 forall 119902119897
119894119895 isin 119875120598 le 0
(31)
By assigning Lagrange multipliers 120579119897
119894isin R and 120582
119897
119894isin R+
to the constraints in the left optimization problem we obtainthe following Lagrange function
119871 (119902 120579 120582) = sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
)
+ |119868| minus 119905
(32)
where 119903119897
119894119895= 120572119897
119895(1 minus 2119910
119894119895) + 120579119897
119894 Its dual function is given as
119863 (120579 120582) = max119902ge0
119871 (119905 119902 120579 120582)
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
max119902119897
119894119895ge0
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894119901119897
119894119895(
119902119897
119894119895minus 119901119897
119894119895
119901119897
119894119895
)
2
)
+ |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895max119905ge0
(119903119897
119894119895119905 minus 120582119897
119894(119905 minus 1)
2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894max119905ge0
(
119903119897
119894119895
120582119897
119894
119905 minus (119905 minus 1)2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894119889lowast
(
119903119897
119894119895
120582119897
119894
) + |119868| minus 119905
(33)
Note that for any feasible 120572 the primal maximizationproblem (31) is bounded and has a strictly feasible solution119901119897
119894119895 thus there is no duality gap between (31) and the
In this section numerical experiments on real-world appli-cations are carried out to verify the effectiveness of theproposed robust probability classifier model Specifically weconsider lithology classification data sets from our practicalapplication We compare our model with the regularizedSVM (RSVM) and the naive Bayes classifier (NBC) on bothbinary and multiple classification problems
All the numerical experiments are implemented in Mat-lab 770 and run on Intel(R) Core(TM) i5-4570 CPU SDPT3solver [27] is called to solve the second order cone programsin our proposed method and the regularized SVM
41 Data Sets Lithology classification is one of the basic tasksfor geological investigation To discriminate the lithology ofthe underground strata various electromagnetic techniquesare applied to the same strata to obtain different features suchas Gamma coefficients acoustic wave striation density andfusibility
Here numerical experiments are carried out on a seriesof data sets the borehole T1 Y4 Y5 and Y6 All boreholesare located in Tarim Basin China In total there are 12 datasets used for binary classification problems and 8 data setsused for multiple classification problems For each data setbased on a prespecified training rate 120574 isin [0 1] it is randomlypartitioned into two subsets a training set and a test set suchthat the size of training set accounts for 120574 of the total numberof samples
42 Experiment Design The parameters in our models arechosen based on the size of data setThe parameter 120598 dependson the number of the classes and defined as 120598 = 120575
2|119869| where
120575 isin (0 1)The choice of 120598 can be explained in this way if thereare |119869| classes and the training data are uniformly distributedthen for each probability 119901
119897
119894119895= 1|119869| its maximal variation
range is between 119901119897
119894119895(1 minus 120575) and 119901
119897
119894119895(1 + 120575) The number of
data intervals 119870119897is defined as 119870
119897= |119868|(|119869| times 119870) such that if
the training data are uniformly distributed then in each datainterval there are 119870 samples in each class In the followingcontext we set 120575 = 02 and 119870 = 8
We compare the performances of the proposed RPCmodel with the following regularized support vectormachinemodel [6] (take the 119895th class for example)
(RSVM) min sum
119894isin119868
120585119894119895
+ 120582119895
10038171003817100381710038171003817119908119895
10038171003817100381710038171003817
st 119910119894119895
(sum
119897isin119871
119908119897
119895119909119897
119894+ 119887119895) ge 1 minus 120585
119894119895 119894 isin 119868
120585119894119895
ge 0 119894 isin 119868
(38)
where 119910119894119895
= 2119910119894119895
minus1 and 120582119895
ge 0 is a regularization parameterAs pointed by [8] 120582
119895ge 0 represents a trade-off between the
number of training set errors and the amount of robustness
8 Mathematical Problems in Engineering
Table 1 Performances of RSVM NBC and RPC for binary classification problems on Y5 data set
tr () RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
with respect to spherical perturbations of the data pointsTo make a fair comparison in the following experiments wewill test a series of 120582 values and choose the one with bestperformance Note that if 120582
119895= 0 we refer to this model as the
classic support vector machine (SVM) See also [6] for moredetails onRSVMand its applications tomultiple classificationproblems
43 Test on Binary Classification In this subsection RSVMNBC and RPC are implemented on 12 data sets for the binaryclassification problems using the cross-validation methodsTo improve the performances of RSVM we transform theoriginal data by the popularly used polynomial kernels [6]
Tables 1 and 2 show the averaged classification per-formances of RSVM NBC and the proposed RPC (over10 randomly generated instances) for binary classificationproblems on Y5 and T1 data sets respectively For each dataset we randomly partition it into a training set and a testset based on the parameter tr which varies from 05 to 09The highest classification accuracy on a training set amongthese three methods is highlighted in bold while the bestclassification accuracy on a test set is marked with an asterisk
Tables 1 and 2 validate the effectiveness of the proposedRPC for binary classification problems compared with NBCand RSVM Specifically for most of the cases RSVM hasthe highest classification accuracy on training sets but itsperformance on test sets is unsatisfactory For most of thecases the proposed RPC provides the highest classification
accuracy on test sets NBC provides better performanceson test sets as the training rate increases The experimentalresults also show that for given training rate PRC can providebetter performances on test sets than that on training setsthus it can avoid the ldquooverlearningrdquo phenomenon
To further validate the effectiveness of the proposed RPCwe test it on additional 10 data sets that is T41ndashT45 andT61ndashT65 Table 3 reports the averaged performances of threemethods over 10 randomly generated instances when thetraining rate is set to 70 Except for data sets T45 T63and T64 RPC provides the highest accuracy on the test setsand for all the data sets its accuracy is higher than 80 Asshown in Tables 1 and 2 the robustness of the proposed RPCguarantees its scalability on the test sets
44 Test onMultiple Classification In this subsection we testthe performances of on multiple classification problems bycomparison with RSVM and NBC Since the performance ofRSVM is determined by its regularization parameter 120582 werun a set of RSVM with 120582 varying from 0 to a big enoughnumber and select the one with the best performance on testsets
Figures 1 and 3 plot the performances of three methodson Y5 and T1 training sets respectively Unlike the case ofbinary classification problems we can see that RPC providesa competitive performance even on the training sets Oneexplanation is that RSVM can outperform the proposed RPCon training sets by finding the optimal separation hyperplane
Mathematical Problems in Engineering 9
Table 3 Performances of RSVM NBC and RPC for binary classification problems on other data sets when tr = 70
Data set RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
Figure 1 Performances of RSVM NBC and RPC on Y5 trainingset
for binary classification problem S while RPC is more robustto extend to solve multiple classification problems since ituses the nonlinear probability information of the data setsThe accuracy of NBC on the training sets also improves asthe training rate increases
Figures 2 and 4 show the performances of both methodson Y5 and T1 test sets respectively We can see that for most
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
055
06
065
07
075
08
085
09
095
1
Accu
racy
on
test
set (
)
Figure 2 Performances of RSVM NBC and RPC on Y5 test set
of the cases RPC provides the highest accuracy among threemethods The accuracy of RSVM outperforms that of NBCon Y5 test set while the latter outperforms the former on theT1 test set
To further test the performance of PRC on multipleclassification problems we carry out more experiments ondata sets M1ndashM6 Table 4 reports the averaged performancesof three methods on these data sets when the training rateis set to 70 Except for the M5 data set PRC always
10 Mathematical Problems in Engineering
06
065
07
075
08
085
Accu
racy
on
trai
ning
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
where a second order cone 119871119899+1 is defined as
119871119899+1
=
119909 isin R119899+1
119909119899+1
ge radic
119899
sum
119894=1
1199092
119894
(29)
Proof For given feasible 120572 satisfying the robust constraints itis straightforward to show that the inner maximum problemis equal to the following minimization problem (MP)
(MP) min 119905
st 119905 ge sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895 + |119868|
forall 119902119897
119894119895 isin 119875120598
(30)
The above constraint can be further reduced to the followingconstraint
max
sum
119895isin119869
sum
119894isin119868
(1 minus 2119910119894119895
) sum
119897isin119871
120572119897
119895119902119897
119894119895
+ |119868| minus 119905 forall 119902119897
119894119895 isin 119875120598 le 0
(31)
By assigning Lagrange multipliers 120579119897
119894isin R and 120582
119897
119894isin R+
to the constraints in the left optimization problem we obtainthe following Lagrange function
119871 (119902 120579 120582) = sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894
(119902119897
119894119895minus 119901119897
119894119895)2
119901119897
119894119895
)
+ |119868| minus 119905
(32)
where 119903119897
119894119895= 120572119897
119895(1 minus 2119910
119894119895) + 120579119897
119894 Its dual function is given as
119863 (120579 120582) = max119902ge0
119871 (119905 119902 120579 120582)
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
max119902119897
119894119895ge0
(119903119897
119894119895119902119897
119894119895minus 120582119897
119894119901119897
119894119895(
119902119897
119894119895minus 119901119897
119894119895
119901119897
119894119895
)
2
)
+ |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895max119905ge0
(119903119897
119894119895119905 minus 120582119897
119894(119905 minus 1)
2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894max119905ge0
(
119903119897
119894119895
120582119897
119894
119905 minus (119905 minus 1)2) + |119868| minus 119905
= sum
119894isin119868
sum
119897isin119871
(120598120582119897
119894minus 120579119897
119894)
+ sum
119894isin119868
sum
119897isin119871
sum
119895isin119869
119901119897
119894119895120582119897
119894119889lowast
(
119903119897
119894119895
120582119897
119894
) + |119868| minus 119905
(33)
Note that for any feasible 120572 the primal maximizationproblem (31) is bounded and has a strictly feasible solution119901119897
119894119895 thus there is no duality gap between (31) and the
In this section numerical experiments on real-world appli-cations are carried out to verify the effectiveness of theproposed robust probability classifier model Specifically weconsider lithology classification data sets from our practicalapplication We compare our model with the regularizedSVM (RSVM) and the naive Bayes classifier (NBC) on bothbinary and multiple classification problems
All the numerical experiments are implemented in Mat-lab 770 and run on Intel(R) Core(TM) i5-4570 CPU SDPT3solver [27] is called to solve the second order cone programsin our proposed method and the regularized SVM
41 Data Sets Lithology classification is one of the basic tasksfor geological investigation To discriminate the lithology ofthe underground strata various electromagnetic techniquesare applied to the same strata to obtain different features suchas Gamma coefficients acoustic wave striation density andfusibility
Here numerical experiments are carried out on a seriesof data sets the borehole T1 Y4 Y5 and Y6 All boreholesare located in Tarim Basin China In total there are 12 datasets used for binary classification problems and 8 data setsused for multiple classification problems For each data setbased on a prespecified training rate 120574 isin [0 1] it is randomlypartitioned into two subsets a training set and a test set suchthat the size of training set accounts for 120574 of the total numberof samples
42 Experiment Design The parameters in our models arechosen based on the size of data setThe parameter 120598 dependson the number of the classes and defined as 120598 = 120575
2|119869| where
120575 isin (0 1)The choice of 120598 can be explained in this way if thereare |119869| classes and the training data are uniformly distributedthen for each probability 119901
119897
119894119895= 1|119869| its maximal variation
range is between 119901119897
119894119895(1 minus 120575) and 119901
119897
119894119895(1 + 120575) The number of
data intervals 119870119897is defined as 119870
119897= |119868|(|119869| times 119870) such that if
the training data are uniformly distributed then in each datainterval there are 119870 samples in each class In the followingcontext we set 120575 = 02 and 119870 = 8
We compare the performances of the proposed RPCmodel with the following regularized support vectormachinemodel [6] (take the 119895th class for example)
(RSVM) min sum
119894isin119868
120585119894119895
+ 120582119895
10038171003817100381710038171003817119908119895
10038171003817100381710038171003817
st 119910119894119895
(sum
119897isin119871
119908119897
119895119909119897
119894+ 119887119895) ge 1 minus 120585
119894119895 119894 isin 119868
120585119894119895
ge 0 119894 isin 119868
(38)
where 119910119894119895
= 2119910119894119895
minus1 and 120582119895
ge 0 is a regularization parameterAs pointed by [8] 120582
119895ge 0 represents a trade-off between the
number of training set errors and the amount of robustness
8 Mathematical Problems in Engineering
Table 1 Performances of RSVM NBC and RPC for binary classification problems on Y5 data set
tr () RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
with respect to spherical perturbations of the data pointsTo make a fair comparison in the following experiments wewill test a series of 120582 values and choose the one with bestperformance Note that if 120582
119895= 0 we refer to this model as the
classic support vector machine (SVM) See also [6] for moredetails onRSVMand its applications tomultiple classificationproblems
43 Test on Binary Classification In this subsection RSVMNBC and RPC are implemented on 12 data sets for the binaryclassification problems using the cross-validation methodsTo improve the performances of RSVM we transform theoriginal data by the popularly used polynomial kernels [6]
Tables 1 and 2 show the averaged classification per-formances of RSVM NBC and the proposed RPC (over10 randomly generated instances) for binary classificationproblems on Y5 and T1 data sets respectively For each dataset we randomly partition it into a training set and a testset based on the parameter tr which varies from 05 to 09The highest classification accuracy on a training set amongthese three methods is highlighted in bold while the bestclassification accuracy on a test set is marked with an asterisk
Tables 1 and 2 validate the effectiveness of the proposedRPC for binary classification problems compared with NBCand RSVM Specifically for most of the cases RSVM hasthe highest classification accuracy on training sets but itsperformance on test sets is unsatisfactory For most of thecases the proposed RPC provides the highest classification
accuracy on test sets NBC provides better performanceson test sets as the training rate increases The experimentalresults also show that for given training rate PRC can providebetter performances on test sets than that on training setsthus it can avoid the ldquooverlearningrdquo phenomenon
To further validate the effectiveness of the proposed RPCwe test it on additional 10 data sets that is T41ndashT45 andT61ndashT65 Table 3 reports the averaged performances of threemethods over 10 randomly generated instances when thetraining rate is set to 70 Except for data sets T45 T63and T64 RPC provides the highest accuracy on the test setsand for all the data sets its accuracy is higher than 80 Asshown in Tables 1 and 2 the robustness of the proposed RPCguarantees its scalability on the test sets
44 Test onMultiple Classification In this subsection we testthe performances of on multiple classification problems bycomparison with RSVM and NBC Since the performance ofRSVM is determined by its regularization parameter 120582 werun a set of RSVM with 120582 varying from 0 to a big enoughnumber and select the one with the best performance on testsets
Figures 1 and 3 plot the performances of three methodson Y5 and T1 training sets respectively Unlike the case ofbinary classification problems we can see that RPC providesa competitive performance even on the training sets Oneexplanation is that RSVM can outperform the proposed RPCon training sets by finding the optimal separation hyperplane
Mathematical Problems in Engineering 9
Table 3 Performances of RSVM NBC and RPC for binary classification problems on other data sets when tr = 70
Data set RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
Figure 1 Performances of RSVM NBC and RPC on Y5 trainingset
for binary classification problem S while RPC is more robustto extend to solve multiple classification problems since ituses the nonlinear probability information of the data setsThe accuracy of NBC on the training sets also improves asthe training rate increases
Figures 2 and 4 show the performances of both methodson Y5 and T1 test sets respectively We can see that for most
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
055
06
065
07
075
08
085
09
095
1
Accu
racy
on
test
set (
)
Figure 2 Performances of RSVM NBC and RPC on Y5 test set
of the cases RPC provides the highest accuracy among threemethods The accuracy of RSVM outperforms that of NBCon Y5 test set while the latter outperforms the former on theT1 test set
To further test the performance of PRC on multipleclassification problems we carry out more experiments ondata sets M1ndashM6 Table 4 reports the averaged performancesof three methods on these data sets when the training rateis set to 70 Except for the M5 data set PRC always
10 Mathematical Problems in Engineering
06
065
07
075
08
085
Accu
racy
on
trai
ning
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
In this section numerical experiments on real-world appli-cations are carried out to verify the effectiveness of theproposed robust probability classifier model Specifically weconsider lithology classification data sets from our practicalapplication We compare our model with the regularizedSVM (RSVM) and the naive Bayes classifier (NBC) on bothbinary and multiple classification problems
All the numerical experiments are implemented in Mat-lab 770 and run on Intel(R) Core(TM) i5-4570 CPU SDPT3solver [27] is called to solve the second order cone programsin our proposed method and the regularized SVM
41 Data Sets Lithology classification is one of the basic tasksfor geological investigation To discriminate the lithology ofthe underground strata various electromagnetic techniquesare applied to the same strata to obtain different features suchas Gamma coefficients acoustic wave striation density andfusibility
Here numerical experiments are carried out on a seriesof data sets the borehole T1 Y4 Y5 and Y6 All boreholesare located in Tarim Basin China In total there are 12 datasets used for binary classification problems and 8 data setsused for multiple classification problems For each data setbased on a prespecified training rate 120574 isin [0 1] it is randomlypartitioned into two subsets a training set and a test set suchthat the size of training set accounts for 120574 of the total numberof samples
42 Experiment Design The parameters in our models arechosen based on the size of data setThe parameter 120598 dependson the number of the classes and defined as 120598 = 120575
2|119869| where
120575 isin (0 1)The choice of 120598 can be explained in this way if thereare |119869| classes and the training data are uniformly distributedthen for each probability 119901
119897
119894119895= 1|119869| its maximal variation
range is between 119901119897
119894119895(1 minus 120575) and 119901
119897
119894119895(1 + 120575) The number of
data intervals 119870119897is defined as 119870
119897= |119868|(|119869| times 119870) such that if
the training data are uniformly distributed then in each datainterval there are 119870 samples in each class In the followingcontext we set 120575 = 02 and 119870 = 8
We compare the performances of the proposed RPCmodel with the following regularized support vectormachinemodel [6] (take the 119895th class for example)
(RSVM) min sum
119894isin119868
120585119894119895
+ 120582119895
10038171003817100381710038171003817119908119895
10038171003817100381710038171003817
st 119910119894119895
(sum
119897isin119871
119908119897
119895119909119897
119894+ 119887119895) ge 1 minus 120585
119894119895 119894 isin 119868
120585119894119895
ge 0 119894 isin 119868
(38)
where 119910119894119895
= 2119910119894119895
minus1 and 120582119895
ge 0 is a regularization parameterAs pointed by [8] 120582
119895ge 0 represents a trade-off between the
number of training set errors and the amount of robustness
8 Mathematical Problems in Engineering
Table 1 Performances of RSVM NBC and RPC for binary classification problems on Y5 data set
tr () RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
with respect to spherical perturbations of the data pointsTo make a fair comparison in the following experiments wewill test a series of 120582 values and choose the one with bestperformance Note that if 120582
119895= 0 we refer to this model as the
classic support vector machine (SVM) See also [6] for moredetails onRSVMand its applications tomultiple classificationproblems
43 Test on Binary Classification In this subsection RSVMNBC and RPC are implemented on 12 data sets for the binaryclassification problems using the cross-validation methodsTo improve the performances of RSVM we transform theoriginal data by the popularly used polynomial kernels [6]
Tables 1 and 2 show the averaged classification per-formances of RSVM NBC and the proposed RPC (over10 randomly generated instances) for binary classificationproblems on Y5 and T1 data sets respectively For each dataset we randomly partition it into a training set and a testset based on the parameter tr which varies from 05 to 09The highest classification accuracy on a training set amongthese three methods is highlighted in bold while the bestclassification accuracy on a test set is marked with an asterisk
Tables 1 and 2 validate the effectiveness of the proposedRPC for binary classification problems compared with NBCand RSVM Specifically for most of the cases RSVM hasthe highest classification accuracy on training sets but itsperformance on test sets is unsatisfactory For most of thecases the proposed RPC provides the highest classification
accuracy on test sets NBC provides better performanceson test sets as the training rate increases The experimentalresults also show that for given training rate PRC can providebetter performances on test sets than that on training setsthus it can avoid the ldquooverlearningrdquo phenomenon
To further validate the effectiveness of the proposed RPCwe test it on additional 10 data sets that is T41ndashT45 andT61ndashT65 Table 3 reports the averaged performances of threemethods over 10 randomly generated instances when thetraining rate is set to 70 Except for data sets T45 T63and T64 RPC provides the highest accuracy on the test setsand for all the data sets its accuracy is higher than 80 Asshown in Tables 1 and 2 the robustness of the proposed RPCguarantees its scalability on the test sets
44 Test onMultiple Classification In this subsection we testthe performances of on multiple classification problems bycomparison with RSVM and NBC Since the performance ofRSVM is determined by its regularization parameter 120582 werun a set of RSVM with 120582 varying from 0 to a big enoughnumber and select the one with the best performance on testsets
Figures 1 and 3 plot the performances of three methodson Y5 and T1 training sets respectively Unlike the case ofbinary classification problems we can see that RPC providesa competitive performance even on the training sets Oneexplanation is that RSVM can outperform the proposed RPCon training sets by finding the optimal separation hyperplane
Mathematical Problems in Engineering 9
Table 3 Performances of RSVM NBC and RPC for binary classification problems on other data sets when tr = 70
Data set RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
Figure 1 Performances of RSVM NBC and RPC on Y5 trainingset
for binary classification problem S while RPC is more robustto extend to solve multiple classification problems since ituses the nonlinear probability information of the data setsThe accuracy of NBC on the training sets also improves asthe training rate increases
Figures 2 and 4 show the performances of both methodson Y5 and T1 test sets respectively We can see that for most
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
055
06
065
07
075
08
085
09
095
1
Accu
racy
on
test
set (
)
Figure 2 Performances of RSVM NBC and RPC on Y5 test set
of the cases RPC provides the highest accuracy among threemethods The accuracy of RSVM outperforms that of NBCon Y5 test set while the latter outperforms the former on theT1 test set
To further test the performance of PRC on multipleclassification problems we carry out more experiments ondata sets M1ndashM6 Table 4 reports the averaged performancesof three methods on these data sets when the training rateis set to 70 Except for the M5 data set PRC always
10 Mathematical Problems in Engineering
06
065
07
075
08
085
Accu
racy
on
trai
ning
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
with respect to spherical perturbations of the data pointsTo make a fair comparison in the following experiments wewill test a series of 120582 values and choose the one with bestperformance Note that if 120582
119895= 0 we refer to this model as the
classic support vector machine (SVM) See also [6] for moredetails onRSVMand its applications tomultiple classificationproblems
43 Test on Binary Classification In this subsection RSVMNBC and RPC are implemented on 12 data sets for the binaryclassification problems using the cross-validation methodsTo improve the performances of RSVM we transform theoriginal data by the popularly used polynomial kernels [6]
Tables 1 and 2 show the averaged classification per-formances of RSVM NBC and the proposed RPC (over10 randomly generated instances) for binary classificationproblems on Y5 and T1 data sets respectively For each dataset we randomly partition it into a training set and a testset based on the parameter tr which varies from 05 to 09The highest classification accuracy on a training set amongthese three methods is highlighted in bold while the bestclassification accuracy on a test set is marked with an asterisk
Tables 1 and 2 validate the effectiveness of the proposedRPC for binary classification problems compared with NBCand RSVM Specifically for most of the cases RSVM hasthe highest classification accuracy on training sets but itsperformance on test sets is unsatisfactory For most of thecases the proposed RPC provides the highest classification
accuracy on test sets NBC provides better performanceson test sets as the training rate increases The experimentalresults also show that for given training rate PRC can providebetter performances on test sets than that on training setsthus it can avoid the ldquooverlearningrdquo phenomenon
To further validate the effectiveness of the proposed RPCwe test it on additional 10 data sets that is T41ndashT45 andT61ndashT65 Table 3 reports the averaged performances of threemethods over 10 randomly generated instances when thetraining rate is set to 70 Except for data sets T45 T63and T64 RPC provides the highest accuracy on the test setsand for all the data sets its accuracy is higher than 80 Asshown in Tables 1 and 2 the robustness of the proposed RPCguarantees its scalability on the test sets
44 Test onMultiple Classification In this subsection we testthe performances of on multiple classification problems bycomparison with RSVM and NBC Since the performance ofRSVM is determined by its regularization parameter 120582 werun a set of RSVM with 120582 varying from 0 to a big enoughnumber and select the one with the best performance on testsets
Figures 1 and 3 plot the performances of three methodson Y5 and T1 training sets respectively Unlike the case ofbinary classification problems we can see that RPC providesa competitive performance even on the training sets Oneexplanation is that RSVM can outperform the proposed RPCon training sets by finding the optimal separation hyperplane
Mathematical Problems in Engineering 9
Table 3 Performances of RSVM NBC and RPC for binary classification problems on other data sets when tr = 70
Data set RSVM NBC RPCTrain () Test () Train () Test () Train () Test ()
Figure 1 Performances of RSVM NBC and RPC on Y5 trainingset
for binary classification problem S while RPC is more robustto extend to solve multiple classification problems since ituses the nonlinear probability information of the data setsThe accuracy of NBC on the training sets also improves asthe training rate increases
Figures 2 and 4 show the performances of both methodson Y5 and T1 test sets respectively We can see that for most
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
055
06
065
07
075
08
085
09
095
1
Accu
racy
on
test
set (
)
Figure 2 Performances of RSVM NBC and RPC on Y5 test set
of the cases RPC provides the highest accuracy among threemethods The accuracy of RSVM outperforms that of NBCon Y5 test set while the latter outperforms the former on theT1 test set
To further test the performance of PRC on multipleclassification problems we carry out more experiments ondata sets M1ndashM6 Table 4 reports the averaged performancesof three methods on these data sets when the training rateis set to 70 Except for the M5 data set PRC always
10 Mathematical Problems in Engineering
06
065
07
075
08
085
Accu
racy
on
trai
ning
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
Figure 1 Performances of RSVM NBC and RPC on Y5 trainingset
for binary classification problem S while RPC is more robustto extend to solve multiple classification problems since ituses the nonlinear probability information of the data setsThe accuracy of NBC on the training sets also improves asthe training rate increases
Figures 2 and 4 show the performances of both methodson Y5 and T1 test sets respectively We can see that for most
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
055
06
065
07
075
08
085
09
095
1
Accu
racy
on
test
set (
)
Figure 2 Performances of RSVM NBC and RPC on Y5 test set
of the cases RPC provides the highest accuracy among threemethods The accuracy of RSVM outperforms that of NBCon Y5 test set while the latter outperforms the former on theT1 test set
To further test the performance of PRC on multipleclassification problems we carry out more experiments ondata sets M1ndashM6 Table 4 reports the averaged performancesof three methods on these data sets when the training rateis set to 70 Except for the M5 data set PRC always
10 Mathematical Problems in Engineering
06
065
07
075
08
085
Accu
racy
on
trai
ning
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
Figure 3 Performances of RSVM NBC and RPC on T1 trainingset
055
06
065
07
075
08
085
09
Accu
racy
on
test
set (
)
06 065 07 075 08 085 09Training rate
RSVMNBCRPC
Figure 4 Performances of RSVM NBC and RPC on T1 test set
provides the highest classification performances among threemethods and even for the M5 data set its accuracy (880)is very close to the best one (881)
From the tested real-life application we conclude that theproposed RPC has the robustness to provide better perfor-mance for both binary and multiple classification problemscompared with RSVM and NBC The robustness of PRCenables it to avoid the ldquooverlearningrdquo phenomenon especiallyfor the binary classification problems
5 Conclusion
In this paper we propose a robust probability classifier modelto address the data uncertainty in classification problems
To quantitatively describe the data uncertainty a class-conditional distributional set is constructed based on themodified 120594
2-distance We assume that the true distribu-tion lies in the constructed distributional set centered inthe nominal probability distribution Based on the ldquolinearcombination assumptionrdquo for the posterior class-conditionalprobabilities we consider a classification criterion using theweighted sum of the posterior probabilities The optimalrobust probability classifier is determined by minimizingthe worst-case absolute error value over all the possibledistributions belonging to the distributional set
Our proposed model introduces the recently developeddistributionally robust optimization method into the clas-sifier design problems To obtain a computable modelwe transform the resulted optimization problem into anequivalent second order cone programming based on conicduality theorem Thus our model has the same compu-tational complexity as the classic support vector machineand numerical experiments on real-life application validateits effectiveness On the one hand the proposed robustprobability classifier provides a higher accuracy comparedwith RSVM and NBC by avoiding overlearning on trainingsets for binary classification problems on the other hand italso has a promising performance for multiple classificationproblems
There are still many important extensions in our modelOther forms of loss function such as the mean squarederror function and Hinge loss functions should be studied toobtain tractable reformulations and the resulted models mayprovide better performances Probability models consideringjoint probability distribution information are also interestingresearch directions
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
References
[1] R O Duda and P E Hart Pattern Classification and SceneAnalysis John Wiley amp Sons New York NY USA 1973
[2] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence (AAAI rsquo92) vol 90 pp 223ndash228 AAAIPress Menlo Park Calif USA July 1992
[3] B D Ripley Pattern Recognition and Neural Networks Cam-bridge University Press Cambridge UK 2007
[4] V Vapnik The Nature of Statistical Learning Theory SpringerBerlin Germany 2000
[5] M Ramoni and P Sebastiani ldquoRobust Bayes classifiersrdquo Artifi-cial Intelligence vol 125 no 1-2 pp 209ndash226 2001
[6] Y Shi Y Tian G Kou and Y Peng ldquoRobust support vectormachinesrdquo in Optimization Based Data Mining Theory andApplications Springer London UK 2011
[7] Y Z Wang Y L Zhang F L Zhang and J N Yi ldquoRobustquadratic regression and its application to energy-growth con-sumption problemrdquoMathematical Problems in Engineering vol2013 Article ID 210510 10 pages 2013
Mathematical Problems in Engineering 11
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013
[8] A Ben-Tal L El Ghaoui and A Nemirovski Robust Optimiza-tion Princeton University Press Princeton NJ USA 2009
[9] A Ben-Tal and A Nemirovski ldquoRobust optimizationmdashmethodology and applicationsrdquo Mathematical Programmingvol 92 no 3 pp 453ndash480 2002
[10] D Bertsimas D B Brown and C Caramanis ldquoTheory andapplications of robust optimizationrdquo SIAM Review vol 53 no3 pp 464ndash501 2011
[11] G R G Lanckriet L E Ghaoui C Bhattacharyya and M IJordan ldquoMinimax probability machinerdquo in Advances in NeuralInformation Processing Systems pp 801ndash807 2001
[12] G R G Lanckriet L El Ghaoui C Bhattacharyya and M IJordan ldquoA robust minimax approach to classificationrdquo Journalof Machine Learning Research vol 3 no 3 pp 555ndash582 2003
[13] L El Ghaoui G R G Lanckriet and G Natsoulis ldquoRobustclassification with interval datardquo Tech Rep UCBCSD-03-1279Computer Science Division University of California 2003
[14] K Huang H Yang I King andM R Lyu ldquoLearning classifiersfrom imbalanced data based on biased minimax probabilitymachinerdquo in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR rsquo04)vol 2 pp 558ndash563 IEEE July 2004
[15] K Huang H Yang I King M R Lyu and L Chan ldquoTheminimum error minimax probability machinerdquo The Journal ofMachine Learning Research vol 5 pp 1253ndash1286 2004
[16] C-H Hoi and M R Lyu ldquoRobust face recognition usingminimax probability machinerdquo in Proceedings of the IEEEInternational Conference on Multimedia and Expo (ICME rsquo04)vol 2 pp 1175ndash1178 June 2004
[17] T Kitahara S Mizuno and K Nakata ldquoQuadratic and convexminimax classification problemsrdquo Journal of the OperationsResearch Society of Japan vol 51 no 2 pp 191ndash201 2008
[18] T Kitahara S Mizuno and K Nakata ldquoAn extension of aminimax approach to multiple classificationrdquo Journal of theOperations Research Society of Japan vol 50 no 2 pp 123ndash1362007
[19] D Klabjan D Simchi-Levi and M Song ldquoRobust stochasticlot-sizing by means of histogramsrdquo Production and OperationsManagement vol 22 no 3 pp 691ndash710 2013
[20] L V Utkin ldquoA framework for imprecise robust one-class classi-fication modelsrdquo International Journal of Machine Learning andCybernetics 2012
[21] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press Cambridge UK 2000
[22] B Scholkopf and A J Smola Learning with Kernels The MITPress Cambridge UK 2002
[23] T Hastie R Tibshirani and J J H Friedman The Elements ofStatistical Learning Springer New York NY USA 2001
[24] L A Zadeh ldquoA simple view of the Dempster-Shafer theory ofevidence and its implication for the rule of combinationrdquo AIMagazine vol 7 no 2 pp 85ndash90 1986
[25] R Yager M Fedrizzi and J Kacprzyk Advances in theDempster-Shafer Theory of Evidence John Wiley amp Sons NewYork NY USA 1994
[26] J F Sturm ldquoUsing SeDuMi 102 a MATLAB toolbox foroptimization over symmetric conesrdquoOptimizationMethods andSoftware vol 11 no 1 pp 625ndash653 1999
[27] K C Toh R H T Tutunu and M J Todd ldquoOn the implemen-tation and usage of SDPT3Cmdasha Matlab software package for
semidefinite quadratic linear programming version 40rdquo 2006httpwwwmathnusedusgsimmattohkcsdpt3guide4-0-draftpdf
[28] A Ben-Tal D D Hertog A D Waegenaere B Melenberg andG Rennen ldquoRobust solutions of optimization problems affectedby uncertain probabilitiesrdquo Management Science vol 59 no 2pp 341ndash357 2013