Top Banner
Neurocomputing 55 (2003) 109 – 134 www.elsevier.com/locate/neucom Hyperparameter design criteria for support vector classiers Davide Anguita , Sandro Ridella, Fabio Rivieccio, Rodolfo Zunino DIBE, Department of Biophysical and Electronic Engineering, University of Genova, Via all’Opera Pia 11a, 16145 Genova, Italy Received 28 February 2002; accepted 8 January 2003 Abstract The design of a support vector machine (SVM) consists in tuning a set of hyperparameter quantities, and requires an accurate prediction of the classier’s generalization performance. The paper describes the application of the maximal-discrepancy criterion to the hyperparameter-setting process, and points out the advantages of such an approach over existing theoretical frameworks. The resulting theoretical predictions are then compared with the k-fold cross-validation empirical method, which probably is the current best-performing approach to the SVM design problem. Experimental results on a wide range of real-world testbeds prove out that the features of the maximal-discrepancy method can notably narrow the gap that so far has separated theoretical and empirical estimates of a classier’s generalization error. c 2003 Elsevier B.V. All rights reserved. Keywords: Generalization error estimate; Hyperparameters tuning; k -fold cross-validation; Maximal-discrepancy criterion; Support vector classier 1. Introduction Vapnik developed support vector machines (SVMs) [7,23] as an eective model to optimize generalization performance by suitably controlling class-separating sur- faces. The opportunity to nd a subset of training patterns that draw the eventual class boundaries was a notable advantage of SVM classiers on their introduction. From a Corresponding author. Fax: +39-010-353-2175. E-mail addresses: [email protected] (D. Anguita), [email protected] (S. Ridella), [email protected] (F. Rivieccio), [email protected] (R. Zunino). 0925-2312/03/$ - see front matter c 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0925-2312(03)00430-2
26

Hyperparameter design criteria for support vector classifiers

Apr 27, 2023

Download

Documents

Elisa Tonani
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hyperparameter design criteria for support vector classifiers

Neurocomputing 55 (2003) 109–134www.elsevier.com/locate/neucom

Hyperparameter design criteria for supportvector classi"ers

Davide Anguita∗ , Sandro Ridella, Fabio Rivieccio,Rodolfo Zunino

DIBE, Department of Biophysical and Electronic Engineering, University of Genova,Via all’Opera Pia 11a, 16145 Genova, Italy

Received 28 February 2002; accepted 8 January 2003

Abstract

The design of a support vector machine (SVM) consists in tuning a set of hyperparameterquantities, and requires an accurate prediction of the classi"er’s generalization performance. Thepaper describes the application of the maximal-discrepancy criterion to the hyperparameter-settingprocess, and points out the advantages of such an approach over existing theoretical frameworks.The resulting theoretical predictions are then compared with the k-fold cross-validation empiricalmethod, which probably is the current best-performing approach to the SVM design problem.Experimental results on a wide range of real-world testbeds prove out that the features of themaximal-discrepancy method can notably narrow the gap that so far has separated theoreticaland empirical estimates of a classi"er’s generalization error.c© 2003 Elsevier B.V. All rights reserved.

Keywords: Generalization error estimate; Hyperparameters tuning; k-fold cross-validation;Maximal-discrepancy criterion; Support vector classi"er

1. Introduction

Vapnik developed support vector machines (SVMs) [7,23] as an e>ective modelto optimize generalization performance by suitably controlling class-separating sur-faces. The opportunity to "nd a subset of training patterns that draw the eventual classboundaries was a notable advantage of SVM classi"ers on their introduction. From a

∗ Corresponding author. Fax: +39-010-353-2175.E-mail addresses: [email protected] (D. Anguita), [email protected] (S. Ridella),

[email protected] (F. Rivieccio), [email protected] (R. Zunino).

0925-2312/03/$ - see front matter c© 2003 Elsevier B.V. All rights reserved.doi:10.1016/S0925-2312(03)00430-2

Page 2: Hyperparameter design criteria for support vector classifiers

110 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

computational perspective, posing the training of SVMs as a constrained quadratic-programming (QP) problem [10] boosted their practical e>ectiveness.The widespread di>usion of SVMs resulted mostly from their successful applications

in real-world domains, where SVMs performed e>ectively, especially in terms of gen-eralization error. If SVM training involves an eHcient QP algorithm, the design of anSVM requires one to tune a set of quantities (‘hyperparameters’) that ultimately a>ectthe classi"er’s generalization error. Therefore, the underlying problem of e>ectivelysetting the hyperparameters of an SVM is of major importance.In spite of the several approaches proposed in the literature to characterize the gen-

eralization ability of SVMs theoretically [6,7,11,23,26], the most accurate estimates oftheir performances in real applications still remain empirical ones [8]. The main reasonfor this conKict between theory and practice consists in the goal of theoretical analysis,which mostly aims to cover widely general cases, thus yielding upper bounds to theclassi"er’s generalization error; by contrast, empirical approaches deal with actual datadistributions and hence give tighter, albeit sometimes ‘optimistic’, estimates.The research presented in this paper faces the problem of SVM design by tack-

ling hyperparameter setting as an indirect result of accurately estimating generalizationperformance. The paper adopts the maximal-discrepancy criterion [3] for bounding theclassi"er’s generalization error; the possibility of applying such a sample-based methodto SVMs represents a somewhat novel feature.From a theoretical standpoint, the paper compares the maximal-discrepancy approach

with existing theoretical frameworks. A signi"cant result of this analysis lies in showingthat the former yields tighter generalization bounds. This is mostly associated with theoverall model simplicity, which is not a>ected by the additional bounding terms thattypically have to be taken into account when one considers the classi"er’s growthfunction or VC-dim explicitly.From a practical perspective, the results from the maximal-discrepancy method also

seem to close on the empirical estimates more than theoretical predictions have usuallydone before. The empirical method that has so far been reported as best performing [8]is compared with the maximal-discrepancy approach. To this end, the method is testedin a set of di>erent real-world applications, including disease diagnosis, OCR, and coinrecognition. The use of standard testbeds make comparisons feasible and reliable; in allcases, the considerable number of experiments carried out always indicate that empiricaland theoretical estimates span a range of values resulting in practical advantage.Section 2 brieKy summarizes the SVM model, mainly to introduce the notation and

to frame the SVM design problem. Section 3 addresses the various approaches tohyperparameter setting and describes the maximal-discrepancy method, comparing itwith existing empirical and theoretical methods for generalization prediction. Section4 presents the experimental results obtained on the di>erent testbeds. Some concludingremarks are made in Section 5.

2. A framework for SVM classi�ers

A learning machine (LM) can be thought of as a mapping:

LM : X × S × Y → �; (1)

Page 3: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 111

where X denotes the d-dimensional input space from which data are drawn, Y is theny-dimensional space of pattern classes (‘targets’), and S is the space holding all thens variables specifying the LM. The error on the data population is a number, �∈�,where � is the closed real-valued interval between 0 and 1, and can be expressed as[3,23]:

� = Ex;y{L(f(s; x); y)}; (2)

where s∈S, x∈X, L(·; ·) is a {0; 1}-valued loss function measuring errors, and f(s; x)is the LM estimate of the actual target value, y.The variables characterizing an LM can be divided into parameters and hyper-

parameters; using training data optimizes the former, the latter typically determine theclassi"er’s regularity and are tuned on a test set. If the LM is a support vector classi"er[7], the target space is Y = {+1;−1}, and the LM structure can be written as

f(s; x) =nSV∑i=1

�iyiK(xi ; x) + b; (3)

where nSV is the number of support vectors, �i are positive parameters, xi are thesupport vectors, K(·; ·) is a kernel function and b is a bias. The support vectors areselected from the input data set, hence nSV6 np, where np is the number of inputpatterns.Expression (3) shows that f(s; x) is a series expansion having K(·; ·) as a basis

and involving part or all of the training examples. The choice of the particular basisK(·; ·) involves a mapping, �, of training data into a higher-dimensional space wherethe kernel function supports an inner product.As a result of this well-known kernel-trick, one can handle inner products of patterns

in a space di>erent from X , yet disregarding the speci"c mapping of each single pattern.By using this notation, expression (3) is rewritten as

f(s; x) =nSV∑i=1

�iyi�(xi) · �(x) + b (4)

and the class-separating surface is a hyperplane in the � space (Fig. 1).Under the hypothesis of linearly separable data, an SVM will choose the maximum-

margin separating hyperplane, that is, the one having the maximum distance fromthe two separating hyperplanes closest to each class (Fig. 1). It has been proved [23]that using that particular hyperplane conveys remarkable generalization properties. Toaccomplish the classi"cation task with the maximum-margin solution, one needs tosolve the following constrained minimum problem (Primal problem):

minw;�;b

{12‖w‖2 + C

np∑i=1

�i

}; subject to: yi(w · �i + b)¿ 1− �i ∀i (5)

where w is an array perpendicular to the separating surface, �i = �(xi), and C is aregularization parameter weighting the relevance of classi"cation errors. Setting C=∞involves that all �i are nulli"ed, and supports linearly separable problems.

Page 4: Hyperparameter design criteria for support vector classifiers

112 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

Fig. 1. f(s; x) as a separating hyperplane lying in a high-dimensional space. Support vectors are insidecircles.

Table 1Some valid Kernel functions

Kernel type Kernel structure

Linear K(x1; x2) = x1 · x2Polynomial K(x1; x2) = (x1 · x2 + 1)P

Radial basis function (RBF) K(x1; x2) = e−�‖x1−x2‖2

The above formulation can take into account non-linearly separable cases by lettingC take on "nite values. The solution of this modi"ed dual problem (DP) can be found[10,15] by maximizing the following cost function:

DP ≡ max�

np∑i=1

�i− 12

np∑i; j=1

�i�jyiyjK(xi; xj)

; subject to:

06 �i6C;∀i;np∑i=1

yi�i = 0;

(6)

where �i are the Lagrange multipliers for the constraints in (5). The DP problem (6)can be solved by using a quadratic-programming algorithm [13,19]. Some valid kernelfunctions are listed above (Table 1).In the following, we shall denote the set of Lagrange multipliers, {�i}, in an SVM

classi"er as ‘parameters’; the set of hyperparameters include the error-weighting value,C, and any other speci"c quantity that shapes the kernel functions, such as � forRBF kernels. Hyperparameter values ultimately a>ect the complexity of the resultingfunction. For example, higher values of C will bring about fewer errors, but the sepa-rating function will turn out to be somewhat twisted; at the same time, kernel-speci"chyperparameters can alter the e>ectiveness of the eventual separating surface.In the speci"c case of RBFs, it has been proved [4] that one can classify any

consistent training set with zero errors by using a suHciently large value of �. The useof RBF kernels will be adopted as a default throughout the paper.

Page 5: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 113

When applied to classi"cation problems, the SVM decision function (3) ascribespattern classes and yields the empirical error rate, �, on the available data set. However,one should be aware that the solution of problem (6) does not actually seek the desiredminimum of �. Indeed, it is known [4] that the solution of SVM training (6) yields aset of �i for which the following bound holds:

�61np

np∑i=1

�i: (7)

As a result, the second term of the cost function (5) that is actually optimized is justan upper bound to the error, �. One should also consider that the cost function involvesthe regularization term ‖w‖2. The use of such an analog cost function is mainly justi"edby the availability of eHcient QP algorithms; in fact, processing discrete costs wouldrequire the expensive use of integer-programming techniques.

3. Hyperparameter tuning criteria for support vector classi�ers

SVM training based on the cost function (6) optimizes the values of the Lagrangemultipliers {�i}, but requires that the values of the hyperparameters be "xed in ad-vance. Then the problem of setting hyperparameters e>ectively is strongly felt [6,8];it does not relate to the empirical training error, but it is rather formulated in termsof generalization performance. As a consequence, the SVM design process ultimatelybrings about the problem of evaluating �. Several tuning criteria exist for this purpose,and can be roughly divided into two basic approaches.Theoretical methods aim to derive the most stringent bounds to the expected gener-

alization error, typically by setting the accuracy of the estimator as a function of thenumber of training patterns. A typical approach of this kind characterizes the expectedgeneralization performance by statistically taking into account the possible distributionof test patterns that might be encountered [23].Empirical methods, instead, aim to squeeze all possible information from the avail-

able data set, and ultimately rely on the fact that the data distribution is well representedby the available sample. A typical empirical technique is to put aside some portionof the data set in the training process, and to use it to assess generalization perfor-mance [8].Statistical learning theory states that a machine featuring a small value of � should

score few errors on training data and, at the same time, exhibit low complexity. Inpractice, this property is often estimated empirically with a test set, and one tuneshyperparameters to shape the best-performing function among those permitted by thespeci"c kernel model.

3.1. A theoretical approach to hyperparameter tuning

This section describes a theoretical approach called “penalization by maximal discrep-ancy” to predicting generalization performance [3]. The method adopts the empiricalerror rate on the training set as an estimate of generalization performance. Therefore,

Page 6: Hyperparameter design criteria for support vector classifiers

114 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

one "rst must train a support vector classi"er on the training set by solving the dualproblem (6), for a given setting of the hyperparameter values {C; �}. Then one mea-sures the empirical classi"cation error attained by the resulting machine on the trainingpatterns; in the following, such a quantity will be denoted as �.Clearly, � estimates � but is subject to a bias, and some penalty should somehow

take into account the classi"er complexity. To compute the required correction term,the maximal-discrepancy method requires that training data be randomly split into twohalves; recent results [5] also explored the possibility of splitting such data into unequalportions. Under the balanced-splitting assumption, the two errors made on the twosubsets can then be de"ned as

�1 =2np

np=2∑i=1

l(f(xi); yi); �2 =2np

np∑i=np=2+1

l(f(xi); yi); (8)

where l(f(xi); yi) = 1 if f(xi) · yi6 0, and l(f(xi); yi) = 0 otherwise.The classi"er’s complexity term that is associated to an empirical estimate is then

attained by maximizing the di>erence (�2 − �1) over parameter con"gurations, {�i};in this process, the hyperparameters C and � are the same used for estimating �. Theproblem to be solved is therefore

max�

(�2 − �1) (9)

The following inequality [3] eventually relates the maximum discrepancy with a bound,�MD, to the generalization error:

P{�MD − �¿max

�(�2 − �1) +

}6 e−2 2np=9 (10)

Expression (10) is usually rewritten to derive the generalization bound

�MD6 � +max�

(�2 − �1) (11)

with probability 1 − ! and an associated con"dence interval given by " =3√log(1=!)=2np.

In some respect, such an approach is similar to that for estimating the e>ectiveVC-dimension [24]. The crucial di>erence here is that discrepancy is used to estimatethe generalization error directly, and neither the VC-dim nor the Growth Function enterthe eventual expression for the estimate of �.A simple and e>ective procedure to solve (9) was described in [3,20,24], and can

be outlined as follows:

1. Split input data into two halves (np is taken even).2. Keep the targets of the pattern in the "rst half unchanged, and Kip those in the

second half. Denote the modi"ed data set by (X; Y ′).3. Train an SVM classi"er on (X; Y ′).

Page 7: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 115

4. Work out the empirical loss, V�, on the modi"ed data set in terms of the realtargets as

V�=1np

np∑i=1

l(f(Xi); Y ′i ) =

12+

1np

np=2∑i=1

l(f(Xi); Yi)− 1np

np∑i=np=2+1

l(f(Xi); Yi)

=1 + �1 − �2

2: (12)

5. Compute the maximum discrepancy value as

max�

(�2 − �1) = 1− 2 V�: (13)

As a result, minimizing (12) amounts to maximizing the discrepancy value in (9).The above algorithm appears quite handy from a computational perspective, as it al-lows one to use conventional QP algorithms and training SVM strategies for solving(9). Under the caveats noted in Section 2 about approximation (7) involved in usingthe SVM optimization process for classi"cation purposes, the practical advantages ofthe above-described algorithm seem overwhelming. That approximation provesacceptable as the implemented criterion applies e>ectively for SVM model selection;indeed, the target-Kip approach has also been used by Vapnik [24] and Cherkassky[20] in their methods to "nd the e>ective VC dimension.The above procedure applies for a "xed con"guration of the hyperparameters C,

�. To extend such an approach to model selection, one simply iterates the algorithminvolving �-computation and discrepancy maximization (9) for various values of C and�; Each con"guration {C; �} will have an associated generalization bound (11). Theoptimal hyperparameter con"guration will be that solving

minC;�

{� +max�

(�2 − �1)}: (14)

The con"dence interval of the bound (11) does not enter the optimization processbecause the quantities involved do not depend on the hyperparameter values. In practice,the procedure is repeated for all pairs {C; �} in a discrete lattice with a given step.

3.2. The maximal-discrepancy criterion and previous theoretical results

This section discusses the relationships between the maximal-discrepancy approachand other well-known theoretical approaches to generalization-performance bounding. Inparticular, Vapnik [23] developed a theoretical framework for predicting a classi"er’sgeneralization performance; a basic result of that comprehensive work is that, withprobability 1− !, one has

�6 � + 2

(1 +

√1 +

4�

); (15)

where = 4=np(lnGF(2np) − ln(!=4)); GF(m) is the classi"er’s growth function[1] that by Sauer’s lemma satis"es GF(m)6 (e ·m=h)h, where h is the classi"er’s VC-dim. Expression (15) does not allow one to sharply separate the complexity and

Page 8: Hyperparameter design criteria for support vector classifiers

116 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

con"dence-correction terms; a comparison of (11) with (15) actually becomes feasibleonly when � = 0. Anyway, common practice clearly indicates that the maximum-discrepancy approach yields much tighter bounds than those resulting from (15).Those bene"ts can be explained by considering a few theoretical issues. Although

Vapnik’s proof involves the same complexity term (9), subsequent derivations penalizethat term by bringing in the concept of growth function, which is further upper boundedby using Sauer’s Lemma. Finally, one should consider that the value of the VC-dimis most often unknown and subject to additional upper bounds.Several approaches have been proposed in the literature to make up for the overesti-

mates involved by Vapnik’s framework. Sample-based techniques disregard the highestgenerality of obtained solutions, and derive generalization bounds by taking into accountthe available training set. The previously cited methods by Vapnik and Cherkassky[20,23] lie in this context. As a matter of fact, estimating an e>ective value of theVC-dim does not free the overall framework from the subsequent penalties broughtabout by the growth function and Sauer’s Lemma. Fat-shattering methods [2] estimatethe generalization error for SVMs by using margin properties; they derive a substitutefor the VC-dim, which is de"ned as the ratio of the radius to the margin and is com-puted as h′ = D‖w‖2. Otherwise, the number of support vectors has been used as anapproximate assessment of the VC-dim [11]; such a technique is certainly interestingwhen the number, nSV, is small.As compared with all the above methods, the maximal-discrepancy technique seems

to perform better (i.e., to yield tighter bounds) thanks to some crucial features: "rst ofall, it has the advantages shared by all sample-based methods; nevertheless, the com-plexity term (9) involves a very small number of subsequent bounding operations, inparticular, it does not take into account the classi"er’s growth function. In this sense, therelationships between the concepts of Vapnik’s theory and the maximum-discrepancyapproach can be illustrated by considering the following curious properties.

Property 1. Let A and B denote two learning machines having the same VC-dim:hA = hB. Then there exists some case in which GFA(m) �= GFB(m).

Property 2. Let A and B be two learning machines such that ∃m: GFA(m)=GFB(m).Then there exists some case in which: E{maxA(�2 − �1)} �= E{maxB(�2 − �1)} overthe same sample.

The proofs of both properties are given in the appendix. The relevance of these factsto the generalization-assessment problem lies in de"ning a sort of graded hierarchy ofquantities arranged according to their descriptive powers.Property 1 seems to suggest the use of the classi"er’s Growth Function in place

of its VC-dim whenever possible, especially because bypassing Sauer’s Lemma yieldstighter generalization bounds. In that case, however, one usually faces the problem thatthe GF of a classi"er is most often unknown and possibly harder to bound than itsVC-dim.Conversely, property 2 also indicates that even characterizing a classi"er in terms

of its growth function may somehow yield incomplete or inaccurate results. Indeed,

Page 9: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 117

the sample-based nature of the discrepancy-based criterion may be more suited to theempirical data distribution, thus better rendering the classi"er’s expected performance.In fact, some researchers recently tried to characterize a classi"er’s generalization

ability ‘qualitatively’ by checking on the existence of some clusters within the targetcon"gurations spanned by the GF. The overall goal is to "nd a classi"er’s abilitydescriptor that improves the trivial con"guration counting supported by the GF. Ananalysis of that kind was based on the concept of ‘connectedness’ [21]. A theoreticaland established explanation for this phenomenon still seems to be lacking, and opensnew interesting vistas for future work.

3.3. Hyperparameter tuning via performance measures

As the actual data distributions in real problems are not known in advance, oneneeds some reliable estimate of the classi"er’s performance; in several real cases,it often happens that the results from empirical methods exhibit a greater practicale>ectiveness than those predicted by theoretical criteria. The comprehensive researchdescribed in [8] presents a review of several techniques, covering both empirical andtheoretical methods: the comparison includes, among others, k-fold cross-validation,Xi-Alpha bound [14], GACV [25], Approximate Span Bound [8], VC Bound [23],D2‖w‖2 [23]. All of the criteria considered were tested on a wide variety of datasets,and provided di>erent results; for example, the span rule given in [23] performs well,although the Approximate Span Bound in [8] does not.The comparative analysis reported therein showed that k-fold cross-validation (CV)

proved the best method in terms of accuracy in predicting generalization performance(and therefore in terms of e>ectiveness in driving model selection). Such a techniquerequires one to partition the original training set into k non-overlapping subsets. Thelearning machine is trained on the union of (k−1) subsets; then one uses the remainingkth subset as a provisional test set and measures the associated classi"cation perfor-mance; in the following, that test error rate will be denoted as �(k)2 . The procedure iscycled over all possible k test sets, and the average test error, �CV, is regarded as a"nal index of generalization performance:

�CV =1k

k∑j=1

�( j)2 : (16)

In principle, the k-fold technique just aims to point out the classi"er that promisesthe best generalization performance. As a matter of fact, the ultimate purpose of theoverall process is not an accurate prediction of the eventual generalization error; modelselection stems from the empirical comparison among di>erent classi"er con"gurations.This is mainly due to the fact that the empirical tests on folded partitions are correlatedwith one another, hence the resulting estimate (16) is a>ected by an empirical bias.As a consequence, a reliable comparison between this method and the theoretical

criteria described in the previous sections requires some prediction of the actual gen-eralization error. In particular, one needs to express the associated estimate of � bymeans of a correction term to be added to the empirical error on the training sets.

Page 10: Hyperparameter design criteria for support vector classifiers

118 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

A consistent and theoretically sound baseline to that purpose can be provided by thestatistical framework of bootstrap techniques [9].The related procedure is very similar to that adopted for �MD. The basic estimator

still is the empirical classi"cation error, �, attained on the entire data set by a trainedSVM. The additional correction term to such an estimate is computed according tothe k-fold approach: one trains an SVM by using k − 1 subsets and minimizes theassociated empirical error, �(k−1)

1 ; then one measures the classi"cation error, �(k)2 , onthe remaining data subset. The di>erence between those quantities gives a discrepancyterm acting as an estimate-correction penalty. Iterating this algorithm over all possiblepartitions and averaging leads to the predicted generalization error, �KF:

�KF = � +1k

k∑j=1

[�(k)2 − �(k−1)

1

]; (17)

where√ln(1=!)=(2np=k) is the con"dence interval [12]. The second term in (17) ex-

presses the bias displacing the resubstitution error from the generalization error.Setting k=2 in (17) clearly points out the di>erence between the theoretical and em-

pirical approaches to estimating �. The complexity penalty in (11) features a worst-casediscrepancy term, max�(�2 − �1), that is typical of theoretical approaches; conversely,the parallel complexity-related contribution in (17) stems from a di>erent discrepancyproblem, �(k)2 − min� �(k−1)

1 , that is instead characteristic of empirical methods. ThisreKects the practical nature of the k-fold method, which requires that generalizationperformance be estimated after minimizing the empirical error, and indirectly explainswhy the associated estimates prove smaller than theoretical ones.It is also worth noting that expression (17) reduces to the conventional quantity used

in k-fold CV (16) if the average training error over the several subsets equals the em-pirical error rate on the entire training set, that is, if � ∼= (1=k)

∑kj=1 �(k)1 ; incidentally,

this result has actually been veri"ed throughout all the experiments performed in thepresent research.Those properties also holds for unbalanced partitions using k �=2, provided one ex-

ploits the results presented in [5]. Clearly no comparison is easy to perform when thetwo approaches use di>erent split portions. The work presented in [8] gives the bestvalue, k = 5, for the splitting strategy. Therefore, thanks to the comprehensive natureof that comparison, the k-fold method with k = 5 will be adopted in the following asthe reference paradigm for the empirically driven design of hyperparameters, and �KF

will denote the associated estimate of the classi"er’s generalization error.

4. Hyperparameter setting in practice

This section compares the theoretical and practical design criteria experimentally.K-fold cross-validation is used as a reference method to assess the performance ofthe maximal-discrepancy criterion. The experiments "rst compare the lowest values ofgeneralization-error estimates obtained by the two methods over an area of the hy-perparameter space; then the two model-selection approaches are also compared by

Page 11: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 119

estimating the actual generalization error they eventually exhibit. The consistency be-tween theory and practice is analyzed by considering di>erent, real-world applications,which provide standard testbeds to make comparisons feasible. The results show thatthe maximal-discrepancy approach actually "ts the empirical "ndings obtained by thek-fold method quite accurately.The experimental procedure adopted was the following. For each testbed, the avail-

able data set was split into a ‘basic’ set for model selection and a ‘validation’ setfor assessing the true generalization error of each classi"er. The basic data set under-went an empirical calibration of the hyperparameters, according to the 5-fold cross-validation runs described in [8]. The constraint parameter was sampled in the rangeC ∈ [1; 105], whereas the � parameter swept in the range �−1 ∈ [1; 107], and the associategeneralization prediction, �KF, was computed according to (17). The most promisingvalues of {C; �} yielding the optimal error estimate were recorded. Likewise, the sameintervals were scanned to search for the best-performing values of {C; �} by usingthe theoretical approach described in Section 3.1. For this method, the optimal valuesof the hyperparameters were recorded, together with the generalization estimate, �MD,computed according to (11).The comparison between the two approaches is attained both by matching the speci"c

hyperparameter settings and by measuring the discrepancies between the individual errorestimates. The former comparison gives an estimate of the robustness of the settingmethod, whereas the latter measures the relative accuracies of the compared approaches.Then the research also validates experimentally the actual prediction performances ofboth methods and the overall consistency of the associated model-selection criteria.

4.1. The breast cancer testbed

Wisconsin’s breast cancer database [27] collects 699 cases for such diagnostic sam-ples; after removing 16 cases with ‘missing’ values, the dataset cardinality eventu-ally amounted to Np = 683 patterns. Each pattern (patient) was represented by anine-dimensional feature vector. The discrimination problem featured an almost bal-anced distribution between the two classes in the original set of patients. The data setwas divided into two sets of 500 and 183 patterns, respectively. The former was usedfor both the k-fold and the theoretical model-selection procedure, whereas the latterprovided the ‘validation’ set for assessing the true generalization error of the designedclassi"ers.When applied to the 500-pattern sample, the 5-fold procedure gave the results shown

in Table 2(a). Boldface "gures point out, for each value of C, the �−1 values yieldingthe smallest predicted error, �KF, computed according to (17); those minima there-fore provide the actual solutions of the hyperparameter-setting problem. For graphicalpurposes, the error distribution is also given as a 3-D surface (Fig. 2).The relatively small values obtained witness the possibly simple nature of this prob-

lem. This result is con"rmed by the outcomes of the theoretical generalization criterion,�MD (11), which are provided in Table 2(b) and graphically presented in Fig. 3.The comparative analysis of the obtained results can be performed from di>erent

viewpoints. First of all, as far as the hyperparameter-design problem is concerned,

Page 12: Hyperparameter design criteria for support vector classifiers

120 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

Table 2BCancer: Perc. generalization error

C �−1

1 10 102 103 104 105 106 107

(a) Estimated empirically by 5-fold, �KF1 9.84 3.28 2.73 1.64 1.64 2.73 3.82 3.8310 9.84 3.28 2.73 2.19 1.64 1.64 2.73 2.73102 9.84 3.28 3.28 2.19 2.19 1.64 2.19 3.28103 9.84 3.28 3.57 3.28 2.19 2.19 1.64 2.19104 9.84 3.28 3.83 4.92 2.19 2.19 2.19 1.64105 9.84 3.28 3.83 3.28 4.37 2.19 2.19 2.19

(b) Predicted theoretically, �MD

1 74.0 55.3 24.3 10.3 8.5 41.0 43.0 40.010 74.0 65.0 40.8 15.8 10.0 10.5 40.0 41.0102 76.0 72.0 53.0 25.3 12.8 10.4 8.5 44.0103 75.0 74.0 61.0 32.0 16.5 10.8 10.4 10.5104 76.0 76.0 67.0 42.0 22.3 14.8 13.8 10.4105 75.0 75.0 73.0 54.0 29.3 17.5 12.8 11.8

(c) Measured on the validation set, �1 12.6 4.0 3.4 2.3 2.9 3.4 4.6 6.910 12.6 3.4 4.6 3.4 2.3 2.9 3.4 4.6102 12.6 3.4 4.0 4.0 2.9 2.3 2.9 3.4103 12.6 3.4 4.0 5.2 3.4 3.4 2.3 2.9104 12.6 3.4 4.6 5.2 4.0 3.4 3.4 2.3105 12.6 3.4 4.6 6.9 5.2 3.4 3.4 3.4

Fig. 2. BCancer: generalization estimates by 5-fold cross-validation.

Page 13: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 121

Fig. 3. BCancer: generalization estimates by theoretical prediction.

Fig. 4. BCancer: comparative results of hyperparameter design.

experimental evidence shows a remarkable "t between the empirical and theoreticalapproaches in suggesting the most promising classi"er con"gurations (Fig. 4).Secondly, the two model-selection methods agree in predicting the overall gen-

eralization error. The 95%-con"dence value of the empirical minimum is given by�(0:05)KF =12:38% (1:91+10:47), which well covers the theoretical prediction �MD=7:2%.

Page 14: Hyperparameter design criteria for support vector classifiers

122 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

Fig. 5. BCancer: generalization errors.

Finally, the accuracy of the two estimators was compared by sweeping again thehyperparameter space to train a set of SVMs with all of the 500 patterns available, andmeasuring the classi"cation errors on the remaining validation patterns. This providedrough estimates of the classi"ers’ generalization performances. The obtained resultsare presented in Table 2(c) and graphically displayed in Fig. 5. Empirical evidencecon"rms the validity of both model-selection methods as estimators of the eventualvalue of �.As expected, the empirical method using the k-fold technique yields a better ap-

proximation over the whole hyperparameter space, whereas the maximal-discrepancycriterion ascribes a severe penalty term to those con"guration liable to over"tting. Therelative advantage of the k-fold approach is mainly due to the empirical nature of theestimation procedure itself.Conversely, it is worth stressing that both methods agree in the error predictions

associated with the model-selection outcomes. If 95% con"dence intervals are ratedacceptable, the most promising hyperparameter setting derived from Table 2(b) {C =102; �−1 = 10−6; �MD = 8:5± 16:42%} predicts a generalization error that well coversthe value predicted by �KF = 1:64±12:24% and the corresponding empirical measureof � = 2:9 ± 9:05% in Table 2(c). Of course, these values are a>ected by quite widecon"dence intervals due to the relatively small number of patterns involved.

4.2. The diabetes testbed

The “Pima Indians diabetes database” [22] provided another clinical testbed. Thedata set included 768 patient descriptions, each characterized by 8 features; the twoclasses were quite intermixed and made this sample rather diHcult to discriminate.

Page 15: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 123

Table 3Diabetes: Perc. generalization

C �−1

1 10 102 103 104 105 106 107

(a) Error estimated empirically, �KF1 33 33 31 26 21 22 33 3310 33 33 35 27 20 21 21 33102 33 33 35 30 21 20 21 22103 33 33 35 33 24 22 23 21104 33 33 35 33 22 20 20 19105 33 33 35 33 21 27 19 21

(b) Predicted theoretically, �MD

1 100.0 100.0 98.0 68.5 40.3 33.5 40.8 37.810 100.0 100.0 100.0 86.3 48.8 34.5 28.3 38.8102 100.0 100.0 100.0 95.5 61.5 44.0 32.8 29.0103 100.0 100.0 100.0 100.0 73.5 45.3 36.5 34.8104 100.0 100.0 100.0 100.0 84.3 52.8 42.0 35.0105 100.0 100.0 100.0 100.0 92.5 58.5 42.0 38.3

(c) Measured experimentally, �1 41 41 42 35 28 32 41 4110 41 41 42 35 28 27 29 41102 41 41 42 47 30 28 26 28103 41 41 42 48 30 24 27 25104 41 41 42 48 34 27 26 28105 41 41 42 48 41 30 25 27

In this case, too, 500 patterns formed the data set used for model selection, and theremaining 268 were used for subsequent validation.Table 3(a) and the associate graph in Fig. 6 provide the experimental results obtained

by processing the diabetes dataset by the k-fold procedure for generalization assessment,as computed in (17). The relatively large values of the estimated error, �KF, witnessthe diHculty of the testbed in separating the classes consistently.Then the data set was processed by using the maximal-discrepancy estimation crite-

rion (11), which yielded the results presented in Table 3(b) and Fig. 7. The diHculttestbed distribution was con"rmed by the fact that a signi"cant portion of the hyper-parameter con"gurations did not allow a meaningful prediction of the generalizationerror (in several cases, �MD = 100%).

In spite of the fact that theoretical predictions for large values of the � parameterappeared useless, the values in the ‘valid’ region of Table 3(b) matched empiricalevidence quite well. The two methods also agreed substantially in choosing the hyper-parameter con"gurations promising the smallest generalization estimates for possiblevalues of C (Fig. 8).Table 3(c) and Fig. 9 give the classi"cation errors on the validation set, and al-

low one to compare the two methods in terms of predicted generalization error. When

Page 16: Hyperparameter design criteria for support vector classifiers

124 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

Fig. 6. Diabetes: generalization estimates by 5-fold cross-validation.

Fig. 7. Diabetes: generalization estimates by theoretical prediction.

averaging over the entire hyperparameter space, the k-fold approach seems to performbetter than the maximal-discrepancy one; nevertheless, for those con"gurations support-ing model selection, the theoretical estimator better succeeds in rendering the inherentproblem complexity and proves much more accurate than the k-fold estimator, whichturns out to be overly ‘optimistic’.

Page 17: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 125

Fig. 8. Diabetes: comparative results of hyperparameter design.

Fig. 9. Diabetes: generalization errors.

4.3. The NIST handwritten-numerals testbed

The optical character recognition application allowed by the NIST database of hand-written numerals o>ered a signi"cant real-world problem to verify the method’s con-sistency. In order to reduce the optimization complexity, 500 patterns were randomlyextracted from the original data set (holding 60,000 patterns) to make up the trainingset; two-class problems were built up by sampling separate pairs of numerals (250patterns per each digit). For the sake of simplicity and brevity, among the several

Page 18: Hyperparameter design criteria for support vector classifiers

126 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

Table 4NIST49: Perc. generalization error

C �−1

1 10 102 103 104 105 106 107

(a) Estimated empirically, �KF1 0.40 0.00 0.90 7.10 8.00 8.00 8.00 8.0010 0.40 0.00 0.10 0.90 6.90 8.00 8.00 8.00102 0.40 0.00 0.20 0.10 0.90 6.90 8.00 8.00103 0.40 0.00 0.30 0.30 0.10 0.90 6.90 8.00104 0.40 0.00 0.30 0.40 0.30 0.10 0.90 6.90105 0.40 0.00 0.30 0.40 0.40 0.30 0.10 0.90

(b) Estimated theoretically, �MD

1 0.98 0.32 0.87 6.64 7.84 7.83 7.84 7.8410 1.00 0.75 0.45 0.88 6.63 7.84 7.82 7.84102 1.00 1.00 0.41 0.40 0.89 6.63 7.84 7.83103 1.00 1.00 0.74 0.29 0.40 0.89 6.64 7.83104 1.00 1.00 0.98 0.41 0.27 0.42 0.87 6.65105 1.00 1.00 1.00 0.73 0.31 0.30 0.42 0.88

(c) Estimated on the validation set1 9.82 2.89 4.61 17.58 18.09 18.09 18.09 18.0910 9.61 2.81 3.54 4.59 17.61 18.09 18.09 18.09102 9.61 2.81 3.62 3.66 4.6 17.61 18.09 18.09103 9.61 2.81 3.82 3.85 3.7 4.59 17.61 18.09104 9.61 2.81 3.82 4.28 3.89 3.7 4.59 17.6105 9.61 2.81 3.82 4.28 4.43 3.89 3.7 4.59

experiments performed with this sampling methodology, only the results on the prob-lem ‘4’ versus ‘9’ are reported here. This case has been picked up mainly because itinvolved the hardest pair of numerals to separate by the SV classi"er; all other resultswere substantially similar to those presented in this section.Each twin-class data set held 80-dimensional patterns; feature extraction followed

the procedure described in [17]. The 5-fold procedure for the 500 patterns extractedgave the generalization estimates (17) shown in Table 4(a) and Fig. 10. The "guresindicated rather a surprising result, as the SVM classi"er always managed to separatethe two classes consistently. In fact, such a peculiar distribution in the NIST databasehad already been observed previously [16], and mainly depends on the fact that thepattern categories in the overall NIST training set are con"ned to thick, well-separatedclusters that make very small errors possible in both the training and testing phases.A very interesting result is obtained when comparing the empirical estimates with

the theoretical predictions, �MD as per (11). Table 4(b) and Fig. 11 give in the usualformats the results of the tests performed. In principle, one should expect theory to di-verge signi"cantly from empirical evidence, mainly due to the random, independent set-tings of pattern classes in the maximal-discrepancy method. Instead, the "gures shownare in notable accordance with the values in Table 4(a). From this viewpoint, the

Page 19: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 127

Fig. 10. NIST49: generalization estimates by 5-fold cross-validation.

Fig. 11. NIST49: generalization estimates by theoretical predictions.

agreement of the generalization estimates in this odd case ultimately provided an im-portant support for the validity of both estimation approaches.It is worth noting that the actual estimates associated with the best-performing con-

"gurations in the two cases were quite close, as the theoretical criterion yielded an

Page 20: Hyperparameter design criteria for support vector classifiers

128 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

Fig. 12. NIST49: generalization performance on the validation set.

estimated error �MD = 0:27%. The most striking di>erence between the two estima-tion methods was observed when considering the hyperparameter-design results, whichdi>ered in the two cases especially for larger values of the hyperparameter C. Thefollowing discussion will point out that this deviation is not so signi"cant as it mightseem, when considering the overall performance.Indeed, the huge sample size of the NIST testbed made an additional measurement

feasible to get a reliable estimate of the true generalization error. The database provideda validation set (including 58,646 patterns), which previous research had proved quitediHcult to classify. The validation set had been drawn from a distribution apparentlyquite ‘distant’ from that used for the training set [16,17]. The validation subset holdingnumerals ‘4’ and ‘9’ amounted to 11,535 patterns.Generalization measurements followed a standard procedure: for each hyperparameter

con"guration tested previously, all of the original 500 patterns were used to trainan SVM, whose classi"cation error was subsequently measured on the validation set.The considerable size of the latter set made the 95% con"dence interval very narrow("=1:14%). Table 4(c) and Fig. 12 present the results in the usual formats, allowingcomparisons with empirical and theoretical predictions.The boldface numbers in Table 4(c) indicate, for each setting of C, the best-

performing value of �. Such results clearly highlight the e>ectiveness of the 5-foldmethod, in terms of both consistency in setting hyperparameters and accuracy in pre-dicting � (the 95% con"dence interval associated with the null values of �KF in Table4(a) is " = 12:24%).Actually one should also note that the error distribution in Table 4(c) has two ‘val-

leys’: one coincides with the minima in Table 4(a) (empirical method) and the other

Page 21: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 129

Table 5Coins: Perc. generalization error estimated theoretically, �MD

C �−1

1 10 102 103 104 105 106 107

1 4.0 5.0 5.0 6.0 6.0 5.0 3.0 4.010 6.0 7.0 5.0 5.0 4.0 5.0 5.0 4.0102 3.0 6.0 5.0 4.0 3.0 6.0 5.0 4.0103 3.0 3.0 6.0 5.0 4.0 4.0 4.0 4.0104 3.0 1.0 2.0 6.0 7.0 5.0 4.0 5.0105 4.0 5.0 4.0 3.0 6.0 6.0 4.0 4.0

with those in Table 4(b) (theoretical method). This very important result clearly indi-cates that both methods ultimately performed satisfactorily, as they always succeededin suggesting e>ective SVM design.

4.4. The coin-recognition testbed

This pattern-recognition testbed involves the classi"cation of coins in vending ma-chines. In fact, the original domain involved "ve pattern classes, and was converted intoa two-class problem by a procedure most similar to that used for the NIST database,namely, by considering only two categories. The resulting data set included 20,000patterns, each described by a "ve-dimensional feature vector representing electromag-netic measures of observed coins. The classes were balanced in the data set, whichwas split into a training set of 1,000 patterns and a validation set of 19,000 patterns.This partitioning mainly aimed to limit the training-set size for computational purposesand to tighten the con"dence interval on the validation results.A peculiarity of the coin-recognition database lies in the sample distribution, which is

fairly easy to classify; indeed, the entire data set was composed of two well-separatedclusters with very few errors. In spite of such apparent simplicity, this real-worldtestbed has a practical signi"cance [18], and has been included here because it can yethelp clarify the behavior of the generalization prediction criteria.The crucial issue about those tests was that the empirical training errors invariably

turned out to be null for both the maximal-discrepancy and the 5-fold training sessions,regardless of the speci"c hyperparameter settings. As compared with the empirical esti-mator (17), this issue also implied a null cross-validation term, hence �KF ≡ 0 ∀C ∀�.The tests with Kipped targets to maximize the discrepancy (9), instead, resulted indi>erent correction terms for the various hyperparameter con"gurations. The results onthe maximal-discrepancy model selection tests, pointing out a best classi"cation errorof �MD = 1%, are shown in Table 5 and Fig. 13.The actual generalization error over the validation set always turned out to be

very small (0.1%) for all con"gurations, again as a result of the particularly favor-able sample distribution. From this viewpoint, the k-fold estimator seemed to performmore accurately than the theoretical one; nevertheless, a noticeable aspect of those

Page 22: Hyperparameter design criteria for support vector classifiers

130 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

Fig. 13. Coins: generalization estimates by theoretical prediction.

experiments was that the predictions based on the maximal-discrepancy criterion werein close agreement with empirical evidence. The fact that a theoretically derived es-timator also exhibits practical validity is not often shared by formal approaches togeneralization performance, and denotes a basic merit of the related framework. Sucha feature mostly results from the sample-based nature of the correction procedure.

5. Discussion and conclusions

A crucial issue in setting up an e>ective SVM classi"er often concerns the method forthe eventual hyperparameter setting; in particular, one generally faces the odd situationin which established theory seldom supports the choices that would instead be suggestedby empirical expertise. This is typically due to the fact that theoretical predictions aimto attain the widest generality, also covering peculiar situations that do not usuallyoccur in real-world cases.This paper has considered a theoretical and an empirical method for predicting a clas-

si"er’s generalization performance within the context of SVMs, and hence for choosingoptimal hyperparameters. The k-fold cross-validation approach has been adopted as a‘reference’ paradigm mainly because it is the favorite approach [8] to e>ectively set-ting up a classi"er on the basis of its generalization behavior; the maximal-discrepancycriterion, which is relatively novel in the context of SVMs, has been chosen for itbene"ts from both a sound theoretical framework and the practical e>ectiveness ofsample-based methods.The di>erent operations of the two approaches help clarify the existing gap between

theory and practice. Both methods split the available sample into two subsets; the theo-retical approach, however, exploits both subsets to derive a bound to the generalization

Page 23: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 131

error, whereas the empirical approach uses one subset for training and the other forestimating �. As a result, the former approach often overestimates the error but, forexample, need not tune the additional parameter k, which rules data partitioning.From this viewpoint, a "rst, signi"cant achievement of the comparative research

presented in this paper lies in showing that the estimates of � obtained by the maximal-discrepancy method and those attained by k-fold CV span quite a narrow interval;as a result, one may choose to adopt the theoretical estimator in order to limit thecomputational cost of empirical tests that is usually required by CV-based techniques.Secondly, experimental results also show that the two approaches often concur in themodel-selection outcomes, as the corresponding hyperparameter settings are stronglycorrelated. Finally, both paradigms seem to provide reliable estimates of the resultingclassi"er’s actual generalization error. As expected, the empirical method based onCV yields better estimates whenever the nature of the problem or the validation setis well correlated with training data. In diHcult problems, however, with complexsample distributions, the advantages of the general approach supported by a theoreticalframework seem to hold great promise and often prove more satisfactory.Therefore an important conclusion that can be drawn from the presented research is

that the maximal-discrepancy framework is of high practical interest for SVM design.The paper has also analyzed the possible advantages of the proposed approach overexisting theoretical frameworks.In spite of the aforesaid apparent convergence, yet a gap does exist between practice

and theory. In this respect, the paper has also de"ned possible lines of research tofurther analyze the obtained results. The idea of connectedness, for example, mightbe developed and applied to interpret theoretical predictions in order to overcomethe overestimates that often a>ect approaches based only on the concept of GrowthFunction or VC-dim.

Acknowledgements

The authors wish to thank the Reviewers for their critical reading of the manuscriptand for providing valuable suggestions, which helped improve the technical contentsand the overall presentation quality.

Appendix

Property 1. Let A and B denote two learning machines having the same VC-dim:hA = hB. Then there exists some case for which GFA(m) �= GFB(m).

Proof. Let A be a linear perceptron over the d-dimensional space; it is known [23]

that, in this case, hA = d+ 1 and GFA(m) = 2m if m6 hA; GFA(m) = 2∑d−1

i=0

(m−1

i

)if m¿ hA. Let B be a nearest-prototype classi"er with nh = d + 1 prototypes; in this

Page 24: Hyperparameter design criteria for support vector classifiers

132 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

case, theory shows [17] that hB = nh = d + 1 and GFB(m) = 2nh independently of m.Therefore one has: hA = hB and ∀m¿d + 1: GFA(m) �= GFB(m).

Property 2. Let A and B be two Learning Machines such that ∃m: GFA(m) = GFB(m).Then there exists some case for which: E{maxA(�2 − �1)} �= E{maxB(�2 − �1)} overthe same sample.

Proof. Consider a discrete distribution, X , including np patterns drawn from the set Zof integer numbers in the range (−∞;+∞); possible target values are ‘1’ and ‘0’.Let A be the ray classi"er [1] de"ned as: f(T )

A (x)=‘1’ if x¿T , and f(T )A (x)=‘0’

otherwise. Clearly, the number of pattern con"gurations spanned over X that A canclassify correctly is GFA(np)=np+1. Computing maxA(�2−�1) follows the proceduredescribed in (13); the classi"cation error, V�A, for a given target con"guration can beobtained by the following algorithm:

E{ V�A}= 12np

∑ 1np

min

{# of ‘0’ following the leftmost ‘1’;

# of ‘1’ preceding the rightmost ‘0’:

By imposing np = 4 one has E{ V�A} = 12=64 = 0:1875 and E{maxA(�2 − �1)} = 1 −2E{ V�A}= 0:625.Let now B be the odd “1-point” classi"er, de"ned as: f(x0)

B (x0) = ‘1’, x0 ∈Z , andf(x0)B (x)=‘0’ ∀x �= x0, x∈Z . One can easily verify that the number of target assignments

over the sample X that B can handle correctly is GFB(np) = np+1. In order to computeV�B, let k denote, for a given target con"guration, the number of patterns in X labeled asclass ‘1’; then one enumerates all possible target con"gurations and counts classi"cationerrors as

E{ V�B}= 12np

np∑k=1

(np

k

)(k − 1)

np:

In the sample case, np=4 gives: E{ V�B}=0:2656 and E{maxB(�2−�1)}=1−2E{ V�B}=0:4869. Thus, GFA(np)=GFB(np) ∀np but E{maxA(�2−�1)} �= E{maxB(�2−�1)}.

When applying the concept of connectedness to the above sample cases, one enu-merates the target con"gurations that are covered by the Growth Function of eachclassi"er. In the case np= 4, the 5 con"gurations accounted by the GFs can be listedas follows:

A ≡ Ray classifer B ≡ One-point classifer

‘0000’ ‘0000’

‘0001’ ‘0001’

‘0011’ ‘0010’

‘0111’ ‘0100’

‘1111’ ‘1000’

Page 25: Hyperparameter design criteria for support vector classifiers

D. Anguita et al. / Neurocomputing 55 (2003) 109–134 133

Averaging the Hamming distance, H , for the con"guration set associated with eachclassi"er gives the average values HA = 2 and HB = 1:6 for the ray and one-pointclassi"ers, respectively. This result indicates that the classi"er B seems more connectedthan A. According to the theory illustrated in [21], this justi"es the fact that theclassi"er B exhibits a smaller discrepancy value than A, even though the two GrowthFunctions coincide.

References

[1] M. Anthony, N. Biggs, Computational Learning Theory, Cambridge University Press, Cambridge, 1992.[2] P. Bartlett, The sample complexity of pattern classi"cation with neural networks: the size of the weights

is more important than the size of the networks, IEEE Trans. Inform. Theory 44 (2) (1998) 525–536.[3] P. Bartlett, S. Boucheron, G. Lugosi, Model selection and error estimation, Mach. Learn. 48 (1–3)

(2002) 85–113.[4] C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowledge

Discovery 2 (2) (1998) 121–167.[5] A. Cannon, J.M. Ettinger, D. Hush, C. Scovel, Machine learning with data dependent hypothesis classes,

J. Mach. Learn. Res. 2 (2002) 335–358.[6] O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parameters for Support Vector

Machines, Mach. Learn. 46 (1–2) (2002) 131–159.[7] C. Cortes, V. Vapnik, Support vector networks, Mach. Learn. 20 (1995) 273–297.[8] K. Duan, S. Keerthi, A. Poo, Evaluation of simple performance measures for tuning svm

hyperparameters, Technical Report CD-01-11, Department of Mechanical Engineering, NationalUniversity of Singapore, Singapore, 2001.

[9] B. Efron, R. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, New York, 1993.[10] R. Fletcher, Practical Methods of Optimization, 2nd Edition, Wiley, New York, 1987.[11] S. Floyd, M. Warmuth, Sample compression, learnability, and the Vapnik-Chervonenkis dimension,

Mach. Learn. 21 (1995) 1–36.[12] W. Hoe>ding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc.

58 (1963) 13–30.[13] http://www.kernel-machines.org/software.html[14] T. Joachims, The maximum-margin approach to learning text classi"ers: methods, theory and algorithms,

Ph.D. Thesis, Department of Computer Science, University of Dortmund, 2000.[15] D.G. Luenberger, Introduction to Linear and Nonlinear Programming, Addison-Wesley, Reading, MA,

1973.[16] S. Ridella, S. Rovetta, R. Zunino, Circular Backpropagation networks embed Vector Quantization, IEEE

Trans. Neural Networks 10 (4) (1999) 972–975.[17] S. Ridella, S. Rovetta, R. Zunino, K-Winner Machines for pattern classi"cation, IEEE Trans. Neural

Networks 12 (2) (2001) 371–385.[18] S. Ridella, R. Zunino, Using K-winner machines for domain analysis—Part II: applications,

Neurocomputing, submitted for publication.[19] B. Scholkopf, C. Burges, A. Smola, (Eds.), Advances in Kernel Methods—Support Vector Learning,

MIT Press, Cambridge, USA, 1998.[20] X. Shao, V. Cherkassky, W. Li, Measuring the VC-dimension using optimized experimental design,

Neural Comput. 12 (2000) 1969–1986.[21] J. Sill, Monotonicity and connectedness in learning systems, Ph.D. Thesis, California Institute of

Technology, 1998.[22] J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler, R.S. Johannes, Using the ADAP learning

algorithm to forecast the onset of diabetes mellitus, Proceeding of the IEEE Symposium on Comp.Applied and Medical Care, IEEE Computer Society Press, Silver Spring, MD, 1988, pp. 261–265.

[23] V. Vapnik, Statistical Learning Theory, Wiley-Interscience Pub., New York, 1998.

Page 26: Hyperparameter design criteria for support vector classifiers

134 D. Anguita et al. / Neurocomputing 55 (2003) 109–134

[24] V. Vapnik, E. Levin, Y. Le Cunn, Measuring the VC-dimension of a learning machine, Neural Comput.6 (1994) 851–876.

[25] G. Wahba, Support Vector Machine, Reproducing Kernel Hilbert Spaces and the Randomized GACV,in: B. Sholkopf, C. Burges, A. Smola (Eds.), Advances in Kernel Methods-Support Vector Learning,MIT Press, Cambridge, MA, 1999.

[26] R.C. Williamson, J. Shawe-Taylor, B. Sch\olkopf, A.J. Smola, Sample based generalization bounds,NeuroCOLT2 Technical Report Series, November 1999, NC-TR-1999-055.

[27] W.H. Wolberg, O.L. Mangasarian, Multisurface method of pattern separation for medical diagnosisapplied to breast cytology, Proc. Natl. Acad. Sci. USA 87 (1990) 9193–9196.

Davide Anguita (M’93) graduated in Electronic Engineering in 1989 and obtainedthe Ph.D. in Computer Science and Electronic Engineering at the University ofGenova, Italy, in 1993. After working as a research associate at the InternationalComputer Science Institute, Berkeley, USA, on special-purpose processors for neu-rocomputing, he joined the Department of Biophysical and Electronic Engineeringat the University of Genova, where he teaches digital electronics. His current re-search focuses on industrial applications of arti"cial neural networks and kernelmethods and their implementation on digital and analog electronic devices. He isa member of IEEE and chair of the Smart Adaptive Systems committee of theEuropean Network on Intelligent Technologies (EUNITE).

Sandro Ridella (M’93) received the Laurea degree in electronic engineering from the University of Genova,Italy, in 1966. He is a full Professor in the Department of Biophysical and Electronic Engineering, Universityof Genova, Italy, where he teaches Inductive Learning. In the last nine years, his scienti"c activity has beenmainly focused in the "eld of neural networks.

Fabio Rivieccio received the Laurea degree in electronic engineering in 1999 andis currently pursuing his Ph.D. in electronic engineering and computer science.His research interests include applications of support vector machines and modelselection criteria.

Rodolfo Zunino (S’90–M’90) received the Laurea degree in electronic engineeringfrom Genova University, Italy, in 1985. From 1986 to 1995, he was a ResearchConsultant with the Department of Biophysical and Electronic Engineering of Gen-ova University. He is currently with the same department as an Associate Professorin Industrial Electronics and Application-oriented Electronic Systems. His main sci-enti"c interests include electronic systems for neural networks, eHcient models fordata representation and learning, advanced techniques for multimedia data process-ing, and distributed-control methodologies. Since 2001 he is contributing as anAssociate Editor of the IEEE Transaction on Neural Networks.