Top Banner
402 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 2, MARCH 2002 Adaptive Resolution Min-Max Classifiers Antonello Rizzi, Massimo Panella, and Fabio Massimo Frattale Mascioli Abstract—A high automation degree is one of the most im- portant features of data driven modeling tools and it should be taken into consideration in classification systems design. In this regard, constructive training algorithms are essential to improve the automation degree of a modeling system. Among neuro-fuzzy classifiers, Simpson’s min-max networks have the advantage to be trained in a constructive way. The use of the hyperbox, as a frame on which different membership functions can be tailored, makes the min-max model a flexible tool. However, the original training algorithm evidences some serious drawbacks, together with a low automation degree. In order to overcome these inconveniences, in this paper two new learning algorithms for fuzzy min-max neural classifiers are proposed: the adaptive resolution classifier (ARC) and its pruning version (PARC). ARC/PARC generates a regularized min-max network by a succession of hyperbox cuts. The generalization capability of ARC/PARC technique mostly depends on the adopted cutting strategy. By using a recursive cutting procedure (R-ARC and R-PARC) it is possible to obtain better results. ARC, PARC, R-ARC, and R-PARC are characterized by a high automation degree and allow to achieve networks with a remarkable generalization capability. Their performances are evaluated through a set of toy problems and real data benchmarks. We also propose a suitable index that can be used for the sensitivity analysis of the classification systems under consideration. Index Terms—Adaptive resolution classifier (ARC), automatic training, classification, min-max model, pruning adaptive resolu- tion classifier (PARC), sensitivity analysis. I. INTRODUCTION A LONG with clustering, function approximation, and pre- diction problems, the automatic resolution of classifica- tion problems is one of the most important challenges in the design of intelligent systems. A great number of practical mod- eling problems can be solved by expressing them as classifica- tion problems. In general, each time we have to assign a given object to one class, we perform a classification task. The objects to be classified are described by measuring a set of their proper- ties, so that each object can be represented by a feature vector. The classification problem consists of generating a computa- tional system (a classifier) , which is able to asso- ciate a class label to any input pattern of . The codomain is a label set where it is not possible to define any distance mea- sure between its elements (or even when a forced distance def- inition may be misleading). The classifier must be determined by considering only the information contained in a set of Manuscript received November 29, 2000; revised July 20, 2001. This work was supported by the Italian Ministry of Instruction, University and Research (M.I.U.R.). The authors are with the INFO-COM Department, University of Rome “La Sapienza,” 00184 Rome, Italy (e-mail: [email protected]; [email protected]; [email protected]). Publisher Item Identifier S 1045-9227(02)01802-7. input–output pairs (training set), so that a given error measure is minimized over a set (test set) with .A classification system CS is a pair CM, TA , where TA is the training algorithm, i.e., the set of instructions responsible for generating, starting from a given training set, one particular in- stance of the classification model (CM). In the following, we will deal with exclusive classification problems, in which each pattern is associated with a unique class label [1]–[4]. Usually, the behavior of a TA will depend on the values of its training parameters . By fixing dif- ferent values for , the classification system will yield distinct classifiers, each characterized by a performance on . Thus, given a particular instance of a classification problem CP, i.e., a pair , the performance of CS with respect to can be estimated by measuring the accuracy of the classi- fier over . By fixing an error measure, the performance will depend both on the classifier and on , i.e., . Since , the performance of a classifica- tion system on is a function of the training parameters, i.e., . It is important to underline that the parameter vector must be fixed in advance by the user; consequently, a low ro- bustness of the TA with respect to is a serious drawback for the practical use of a classification system. Furthermore, some classification systems require additional a priori choices by the user. In fact, the task of a classifier is to fix a specific determination of the model parameters , so that , with . For example, if we consider an MLP model, is the set of the connection weights between its neurons. If the adopted classification model does not have a fixed structure, its complete characterization should also include a determination of the structural parameter set , so that . Usually, must be fixed prior to the optimization performed on by TA. For example, in order to use an MLP model, the user must fix in advance the number of layers and the number of neurons in each layer. These choices are very critical with respect to the final classi- fier performances, since a wrong structure usually causes over- fitting or prevents the optimization procedure from converging on a solution. It is clear that, although the structure parame- ters are formally model parameters, and hence they should not be mistook for training parameters, they play the same role as parameters, since . From the user’s point of view this means an additional difficulty, since the classifier structure must be fixed in advance. By relying on some opti- mization criterion (such as MDL, AIC, etc.) [5]–[11], it is pos- sible to design the TA so that it automatically chooses the clas- sifier structure. We say that a TA is constructive if it is able to determine both and parameters, so that , with and . 1045-9227/02$17.00 © 2002 IEEE
13

Adaptive Resolution Min-Max Classifiers

Jan 23, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Adaptive Resolution Min-Max Classifiers

402 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 2, MARCH 2002

Adaptive Resolution Min-Max ClassifiersAntonello Rizzi, Massimo Panella, and Fabio Massimo Frattale Mascioli

Abstract—A high automation degree is one of the most im-portant features of data driven modeling tools and it should betaken into consideration in classification systems design. In thisregard, constructive training algorithms are essential to improvethe automation degree of a modeling system. Among neuro-fuzzyclassifiers, Simpson’s min-max networks have the advantage to betrained in a constructive way. The use of the hyperbox, as a frameon which different membership functions can be tailored, makesthe min-max model a flexible tool. However, the original trainingalgorithm evidences some serious drawbacks, together with a lowautomation degree. In order to overcome these inconveniences,in this paper two new learning algorithms for fuzzy min-maxneural classifiers are proposed: the adaptive resolution classifier(ARC) and its pruning version (PARC). ARC/PARC generatesa regularized min-max network by a succession of hyperboxcuts. The generalization capability of ARC/PARC techniquemostly depends on the adopted cutting strategy. By using arecursive cutting procedure (R-ARC and R-PARC) it is possibleto obtain better results. ARC, PARC, R-ARC, and R-PARC arecharacterized by a high automation degree and allow to achievenetworks with a remarkable generalization capability. Theirperformances are evaluated through a set of toy problems andreal data benchmarks. We also propose a suitable index that canbe used for the sensitivity analysis of the classification systemsunder consideration.

Index Terms—Adaptive resolution classifier (ARC), automatictraining, classification, min-max model, pruning adaptive resolu-tion classifier (PARC), sensitivity analysis.

I. INTRODUCTION

A LONG with clustering, function approximation, and pre-diction problems, the automatic resolution of classifica-

tion problems is one of the most important challenges in thedesign of intelligent systems. A great number of practical mod-eling problems can be solved by expressing them as classifica-tion problems. In general, each time we have to assign a givenobject to one class, we perform a classification task. The objectsto be classified are described by measuring a set of their proper-ties, so that each object can be represented by afeature vector.The classification problemconsists of generating a computa-tional system (a classifier) , which is able to asso-ciate a class label to any input pattern of. The codomain is alabel set where it is not possible to define any distance mea-sure between its elements (or even when a forced distance def-inition may be misleading). The classifier must be determinedby considering only the information contained in a set of

Manuscript received November 29, 2000; revised July 20, 2001. This workwas supported by the Italian Ministry of Instruction, University and Research(M.I.U.R.).

The authors are with the INFO-COM Department, University of Rome“La Sapienza,” 00184 Rome, Italy (e-mail: [email protected];[email protected]; [email protected]).

Publisher Item Identifier S 1045-9227(02)01802-7.

input–output pairs (training set), so that a given error measureis minimized over a set (test set) with . Aclassification systemCS is a pair CM, TA , where TA is thetraining algorithm, i.e., the set of instructions responsible forgenerating, starting from a given training set, one particular in-stance of theclassification model(CM). In the following, wewill deal with exclusiveclassification problems, in which eachpattern is associated with a unique class label [1]–[4].

Usually, the behavior of a TA will depend on the values ofits training parameters . By fixing dif-ferent values for , the classification system will yield distinctclassifiers, each characterized by a performance on. Thus,given a particular instance of a classification problem CP,i.e., a pair , the performance of CS with respect to

can be estimated by measuring the accuracy of the classi-fier over . By fixing an error measure, the performance willdepend both on the classifier and on, i.e., .Since , the performance of a classifica-tion system on is a function of the training parameters, i.e.,

. It is important to underline that the parameter vectormust be fixed in advance by the user; consequently, a low ro-

bustness of the TA with respect to is a serious drawback forthe practical use of a classification system.

Furthermore, some classification systems require additionalapriori choices by the user. In fact, the task of a classifier is to fixa specific determination of themodel parameters , so that

, with . For example, if weconsider an MLP model, is the set of the connection weightsbetween its neurons. If the adopted classification model doesnot have a fixed structure, its complete characterization shouldalso include a determination of the structural parameterset , so that . Usually, must be fixedprior to the optimization performed on by TA. For example,in order to use an MLP model, the user must fix in advancethe number of layers and the number of neurons in each layer.These choices are very critical with respect to the final classi-fier performances, since a wrong structure usually causes over-fitting or prevents the optimization procedure from convergingon a solution. It is clear that, although the structure parame-ters are formally model parameters, and hence they shouldnot be mistook for training parameters, they play the same roleas parameters, since . From the user’s pointof view this means an additional difficulty, since the classifierstructure must be fixed in advance. By relying on some opti-mization criterion (such as MDL, AIC, etc.) [5]–[11], it is pos-sible to design the TA so that it automatically chooses the clas-sifier structure. We say that a TA isconstructiveif it is able todetermine both and parameters, so that ,with and .

1045-9227/02$17.00 © 2002 IEEE

Page 2: Adaptive Resolution Min-Max Classifiers

RIZZI et al.: ADAPTIVE RESOLUTION MIN-MAX CLASSIFIERS 403

From the above discussion, it is clear that even if the mostimportant feature of a classification system is its generalizationcapability, another very important feature is itsautomation de-gree. A low automation degree can be a serious drawback for aclassification system, since it could prevent its successful em-ployment in practical applications, where the users do not haveparticular skills in neuro-fuzzy modeling (medical diagnosticaid software, decision tools for banking and insurance applica-tions, etc.) or when the neuro-fuzzy inference engine plays acore role in a more complex modeling system.

Ideally, a classification system should satisfy the followingproperties.

1) The TA is constructive.2) There are only a few training parameters to be fixed in

advance by the user.3) The classification system is robust with respect to its

training parameters, i.e., the classifier performance mustnot depend critically on .

It is important to underline that if a classification system sat-isfies only the third property, it cannot be consideredautomatic,since the structural parameters are very critical. On the otherhand, if a set of classification systems satisfies the first two prop-erties, it is possible to compare their automation degrees by re-lying on the robustness of the learning procedures with respectto the training parameters. In this regard we can adopt a propermeasure, as the one proposed in Section IV.

A powerful constructive tool to solve exclusive classificationproblems is the fuzzy min-max neural network proposed bySimpson [12]; its performances are clearly illustrated in [13].It is important to point out the noticeable effort devoted in thetechnical literature to the problem of improving the generaliza-tion capability of the min-max algorithm [14]–[16]. However,when used in batch applications, the original training algorithmis affected by some serious problems. Namely, the training re-sult is excessively dependent on pattern presentation order andon the main training parameter. Moreover, this parameter im-poses the same constraint on coverage resolution in the wholeinput space. This last drawback is particularly critical since it re-duces the generalization capability of the neural model. In orderto avoid these inconveniences, two new training algorithms areproposed: the adaptive resolution classifier (ARC) and pruningARC (PARC) algorithms [17], [18]. These algorithms allow afully automatic training of min-max network, since no criticalparameters must be fixed in advance. Furthermore, the adaptiveresolution mechanism significantly enhances the generalizationcapability of trained models.

In Section II, we present a brief overview of min-max classifi-cation system, together with an optimized version of the trainingalgorithm called optimized min-max (OMM). The basic ideasunderlying the ARC and PARC algorithms are discussed in Sec-tion III. In Section IV, the automation degrees of ARC andPARC are compared to those of the original min-max and ofthe OMM algorithms, by means of a detailed sensitivity anal-ysis. The good performances of ARC and PARC are further con-firmed in Section V, by a set of toy problems and real data bench-marks. A modified learning strategy (R-ARC and R-PARC),based on a recursive procedure [19], is proposed in Section VI.

Some encouraging results strengthen the validity of the newapproach.

II. OVERVIEW OF MIN-MAX CLASSIFIERS

The min-max classification strategy consists of covering thepatterns of the training set with hyperboxes whose boundary hy-perplanes are parallel to the main reference system. It is possibleto establish size and position of each hyperbox by two extremepoints: the “min” and “max” vertices. The hyperbox can be con-sidered as a crisp frame on which different types of member-ship functions can be adapted. In the following, we will adoptthe original Simpson’s membership function [12], in which theslope outside the hyperbox is established by the value of a pa-rameter . A min-max classification model is a feedforwardthree-layer network, where the first layer aims only to supplythe input features to each neuron of the second (hidden) layer.Each neuron of the hidden layer corresponds to a hyperboxand computes the membership of the input pattern with respectto . The third (output) layer has one neuron for each class.Each neuron of the output layer determines the fuzzy member-ship of the input pattern with respect to the corresponding class.When we are dealing with exclusive classification problems, theclass corresponding to the maximum membership value is se-lected as the output class label (“winner takes all” strategy). Ifit is different from the actual output (the class label associatedin the training set with the input pattern), we have anerror. Ifmore than one class reach the maximum membership value, wehave anindetermination.

The original training algorithm is characterized by athree-step process [12]:

1) expansion of hyperboxes, for accommodating a new ex-ample of the training set;

2) overlap test, for determining overlaps among hyperboxesassociated with different classes;

3) contraction, for eliminating overlaps.Let be the th pattern of the training set and its th

component, . In order to accommodate , in step 1it is necessary to identify a hyperbox that can be expanded. Theset of candidate hyperboxes is determined by an expansion con-dition. Namely, a hyperbox can be expanded to includeif it satisfies the following constraint:

(1)

where and are, respectively, theth component of themin and max vertices of . Among the set of hyperboxes thatmeet the expansion criterion, the training algorithm will expandthe one which is characterized by the highest degree of mem-bership in . If the set of candidate hyperboxes is empty thena new hyperbox will be generated and its min and max verticeswill be set equal to . This means that the expansion processis mainly controlled by the parameter that establishesthe maximum size allowed to each hyperbox. Smallvalueswill yield min-max networks characterized by a great numberof hyperboxes. For this reason, by fixingit is possible to ad-just the training set coverage resolution. This parameter is verycritical, since it directly affects the number and position of the

Page 3: Adaptive Resolution Min-Max Classifiers

404 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 2, MARCH 2002

resulting hyperboxes and, consequently, it influences the net-work structure and the classification performance. In fact, givena set of samples, each determinationin will yield a min-max network characterized by a perfor-mance on the test set. In order to select the network withthe best generalization capability in advance (i.e., without anyknowledge of the test set), it is possible to consider a modifiedtraining algorithm [20], which refers to the Occam’s razor cri-terion [21]. This well-known principle of learning theory statesthat, under the same condition of performance on the trainingset, the net that shows the best generalization capability is theone that is characterized by the lowest structural complexity.According to this principle, we should choose the network thatminimizes at the same time both its complexity and its error onthe training set. This is a typical multiobjective minimizationproblem, since usually a more complex network will correspondto a better performance on the training set (overfitting phenom-enon). A common way to solve this type of problems consists indefining a convex linear combination of two terms measuring,respectively, the complexity and the classification errorof the network on , i.e.:

(2)

By minimizing the objective function (2) it is possible to selecta regularizedneural network, which ideally will correspond tothe one characterized by the best generalization capability. Ob-viously, the effectiveness of such a heuristic optimization pro-cedure will depend on the available set of candidate models andon the way we define both the complexity measure and the clas-sification error. Regarding this last topic, in expression (2) weset as the percentage of examples not satisfied of the trainingset and as the percentage of hyperboxes with respect to thetraining set cardinality. The behavior of the objective functionwill depend on the weight , by which the user can con-trol the training procedure, taking into account that small valuesof will yield more complex nets and consequently more ac-curate on (the opposite situation arises for large values of

). Hence, is the default compromise value. In the fol-lowing, we will refer to this modified version as OMM. The ad-vantage of OMM with respect to the original training algorithmis the automatic selection of the regularized network, togetherwith the corresponding value, according to the objective func-tion (2). This feature is very important when considering theautomation degree of the training procedure.

However, the OMM training algorithm still shares with theoriginal one the following drawbacks:

• an excessive dependence on the presentation order of thetraining set;

• the same constraint on coverage resolution in the wholeinput space.

In fact, each hyperbox must satisfy the constraint (1) everywherein the input space. If we want to reconstruct with high accuracya complex boundary in the decision region (see toy problem ofFig. 6, Section V-A), we must use small hyperboxes even inthe input space regions far from class boundaries, where smallhyperboxes are not needed. For this reason, the use of a fixedconstraint like (1) implies to set the same coverage resolutionin the whole input space. In order to overcome these inconve-

niences, two new training algorithms are developed: the ARCand PARC.

III. ARC AND PARC TRAINING ALGORITHMS

With respect to the original training algorithm, theARC/PARC technique uses a different training mecha-nism [17], [18], still yielding the same type of neuro-fuzzymodel (i.e., the classical min-max neural network). In thissection, we will describe the basic concepts underlying theARC/PARC technique. In order to explain the new trainingalgorithm, we need to introduce a specific notation and to statesome definitions.

A. Basic Definitions and Adopted Notation

Given a pattern and a hyperbox , with vertices and, we say that covers (and that is coveredby ) if

and only if . Considered a hy-perbox and a training set , it is univocally determined thesubset of patterns covered by

. On the other hand, given a set of patterns, the hyperboxwhich covers all the patterns inalways exists but it is not

unique. Among all the possible hyperboxes which covers, wecan consider theminimum sizeone , defined by the vertices

(3)

Each pattern of the training set is associated by definition witha class label; given an instance of a classification problemcharacterized by classes, we can code without loss of gener-ality each class label with an integer in the range from 1 to.We will use the notation to state that a pattern is asso-ciated with the class label. Given a hyperbox that covers aset of labeled patterns, we willsay that is pure if and only if , otherwise

will be saidhybrid. If is pure, it will be associated withthe class label; if is hybrid it will be associated by defaultwith the label “0.” We will use the notation to represent ahyperbox associated with the label .

Given a generic set of labeled patterns, a setof labeledhyperboxes is acoverageof if only pure hyperboxes associ-ated with the same label can overlap, i.e.,

This property is mandatory when dealing with exclusive classi-fication problems.

If each pattern in is covered by (at least) one hyperbox (pureor hybrid) in , then is atotal coverageof . If each hyperboxin is pure, then is apurecoverage. If a coverage includesat least one hyperbox per class, then it is said to becomplete; atotal and pure coverage is always complete.

It is important to underline that, given a coverage of a trainingset , it is univocally determined the corresponding min-maxneural network, i.e., the classification model. Thenet generationprocedure is the operation of building a neuro-fuzzy min-maxmodel from a specific coverage; this procedure consists inassociating with each pure hyperbox ofa fuzzy membershipfunction and in defining a neural model with a labeled hiddenneuron for each pure hyperbox in. Obviously, there will be al-

Page 4: Adaptive Resolution Min-Max Classifiers

RIZZI et al.: ADAPTIVE RESOLUTION MIN-MAX CLASSIFIERS 405

ways output neurons, one for each class, regardless of. Thisis because is univocally determined once a particular instanceof a classification problem has been considered. Since each cov-erage of defines a specific min-max network , we willindicate with the classification error of over . Evi-dently, if a pattern is covered by a hyperbox ,every min-max network having in its hidden layer a neuron cor-responding to will classify correctly . Thus, if is atotal and pure coverage of then .

Given two pure hyperboxes and , let

be the hyperbox of minimum size containing both and

(4)

The hyperbox is said to be thefusionbetween the pair

of hyperboxes and ; this new hyperbox will be associ-ated with the same class label. Given a coverage, the fusion

between two hyperboxes and belonging to

is saidadmissible, with respect to , if does not overlapwith any hyperbox in associated with a label different from,including the special label “0.” In other words, the fusionis admissible if the set of hyperboxes obtained from bydeleting and , and by adding , is a coverage. Itis easy to demonstrate that ifis a total coverage, then alsois a total coverage. A coverageis saidreducibleif exists atleast a pair of labeled hyperboxes such that their fusion is ad-missible with respect to . The whole succession of hyperboxfusions that transforms a reducible coveragein an irreduciblecoverage is saidfusion procedure. As a direct consequenceof the previous discussion, if is a total coverage then alsois a total coverage.

B. ARC Training Algorithm

The main aim of the ARC procedure is to construct a total andpure coverage of obtained through a succession of cover-ages, and hence of min-max models, of increasing complexity.The basic operation to obtain this succession is thehyperboxcut. It consists in cutting a hybrid hyperbox in two parts by ahyperplane, perpendicular to one coordinate axis and in corre-spondence to a suitable point. Each time a new coverageisconstructed, a fusion procedure is performed, the correspondingclassifier is generated and the objective functiongiven by (2)is computed. The regularized net will correspond to the min-imum value of .

The initialization step of the ARC training algorithm consistsin considering the total coverage of constituted by theunique minimum size hybrid hyperbox , covering all thepatterns in . Since does not contain any pure hyperbox,it cannot be used to generate a min-max model. Thus, it is nec-essary to define a succession of cuts, with the aim to isolate newpure hyperboxes from the hybrid ones. This is the goal of thegeneric step of ARC, which consists of selecting a hybrid hy-perbox and then cutting it.

In order to assure the convergence of the ARC algorithm, weneed to define a cutting procedure such that if we cut a hybridhyperbox of a given coverage of , the resulting set of

hyperboxes is a coverage as well. Since we are interested inbuilding classical min-max networks, i.e., classification modelswhere hyperboxes have their boundary hyperplanes parallel tothe coordinate axis of the main reference system, the cuttinghyperplane must be parallel to an edge of the considered hybridhyperbox. In fact, this is the only way to assure that the inter-section between the two offspring hyperboxes is empty. For thisreason, the expression of the implicit equation of a cutting hy-perplane takes the form .

The cutting operation can be considered as an operatorthat generates two new hyper-

boxes as a function of the hybrid hyperbox to be cut,the coordinate axis , and the coefficient . Now, we needto compute and such that each pattern previouslycovered by will be covered by only one of the two newhyperboxes. Let be the set of patterns covered by ispartitioned into two new sets of patterns. Letand be theset of patterns that will be covered by and , respectively,with and . We can construct andas follows. For each pattern covered by , ifthen else . Successively we compute, by using(3), the min and max vertices of the minimum size hyperboxes

and that cover and , respectively.After the cutting operation, a new total coverageis ob-

tained from the original one, by deleting and by addingthe two new hyperboxes. We can have two distinct cases.

1) Both hyperboxes are hybrid.2) At least one of the two new hyperboxes is pure.

In the first case, we associate and with the special label“0.” Since the set of pure hyperboxes inand is the same,

and thus it is not necessary to perform any fu-sion procedure and to generate a new min-max network. In thesecond case, one or both the new hyperboxes arepromotedfromthe hybrid status to the pure status. If a new hyperbox has beenpromoted, then we will associate with it the correct class label,else we will associate the label “0.” In this case,contains atleast one more pure hyperbox with respect to the original cov-erage, and thus a fusion procedure overis performed for gen-erating a new classification model .

What we need to completely define a cutting strategy is a cri-terion to select the hybrid hyperbox to be cut and the hyperplanecoefficients. Several alternatives are possible; among them, thecriterion described by the following steps achieved the best per-formances (see the two-dimensional example in Fig. 1).

1) The hybrid hyperbox covering the highest number ofpatterns is selected.

2) Let be the set of patterns of classcovered byand ; if is the number of classes, then

. We consider the two class labels andthat are the most frequent ones with respect to the wholenumber of patterns covered by; i.e.,

(5)

Let us denote by and the centroids of and, and by their distance along theth coordinate axis,

Page 5: Adaptive Resolution Min-Max Classifiers

406 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 2, MARCH 2002

Fig. 1. ARC cutting procedure.

i.e., . The edge of to becut is the one parallel to the coordinate axissuch that

;3) is chosen as the average value between theth coordi-

nates of and , i.e.,

(6)

A synoptic flow chart of the ARC training procedure is shownin Fig. 2. We can always consider the coverageas the union oftwo distinct sets of hyperboxes: the set containing only thehybrid hyperboxes in and the set containing only the pureones. Consequently, the ARC procedure will stop whenisempty.

C. On the Convergence of ARC Procedure

A pair of patterns is said to beincongruentifand . It is important to underline that a pair

of patterns can be incongruent because of some noise in theinput and/or output space or as a consequence of the precisionof the adopted number representation. We say that a trainingset is congruent(with respect to a given number representa-tion) if does not exist any incongruent pair of patterns. Ifis incongruent, depending on the nature of the classificationproblem, we can decide to abort the modeling process or toobtain from a congruent training set by deleting a propersubset of patterns. Although the congruence of a training set isa common property in most of the classical benchmarking datasets, it should always be verified when dealing with data comingfrom real modeling problems.

Proposition: The ARC procedure will converge in a finitenumber of steps to a total and pure coverageof if andonly if is a congruent finite set of labeled patterns.

Proof: First of all, we will show that there exists at leastone total and pure coverage of if and only if is con-gruent. Let be the number of patterns in . Let us con-sider the set of the pure hyperboxes obtained by cov-

ering each pattern of with a hyperbox such that. Obviously,

is a pure coverage of if and only if is congruent; more-over, is a total coverage by construction. Thus, if isincongruent, no total and pure coverage ofexists, thus ARCwill never stop.

Now, given a congruent , we want to show that canbe obtained trough a succession of cuts, in the worst case.

Fig. 2. ARC flow chart.

Given a labeled and congruent setof patterns, letbe the number of cuts necessary to obtain the coverageof . Let us consider a cuton , such that the sets of patterns covered by and

, respectively, are nonempty. Since, by definition of thecutting procedure, is a coverage of , we have

. Obviously, ifthen . Therefore, by cutting recursively

until is obtained, we have .The inductive step used to demonstrate this proposition works

even in the presence of fusion steps, as long as the fusion pro-cedure is designed such that it takes as its argument a coverageand returns a new set of hyperbox that is a coverage as well. Asexplained at the end of Section III-A, the fusion procedure wasdesigned exactly with this property. Moreover, the fusion proce-dure is performed over the set of pure hyperboxes in the currentcoverage, returning the same set of hybrid hyperboxes. Sincecuts are performed on the hybrid hyperboxes, the number of cutsneeded to obtain is unchanged after a fusion procedure.

It is important to remark that if is a congruent pair of pat-terns, then ; conversely, if is an incongruentpair, the proposed cutting procedure is unable to determine acoverage of . If is the number of cuts actually performedby ARC over , then is much less then inmost cases, since mainly depends on the topological andmetric structure of .

D. PARC Training Algorithm

Since the computational cost of the fusion step is ,(where , i.e., the number of pure hyperboxes inthe coverage ), the speed performance of ARC depends mostlyon how many times a fusion procedure is performed and on thenumber of hyperboxes in when the fusion procedure starts.Thus, a way to speed up the ARC technique consists in reducingthe total number of nets generated during training, and in re-ducing their average complexity. It is possible to follow thisstrategy by developing PARC.

The PARC technique performs two subsequent procedures:

• an ARC procedure, without the fusion step, aiming toreach a pure and total coverageof ; during this step

Page 6: Adaptive Resolution Min-Max Classifiers

RIZZI et al.: ADAPTIVE RESOLUTION MIN-MAX CLASSIFIERS 407

(a) (b)

(c) (d)

Fig. 3. An example of the pruning procedure in PARC: (a) the training set;(b) the total coverage produced by ARC; (c) the network after the first structurereduction; (d) the network after the second structure reduction.

no network is generated, and thus it is not necessary tocompute the objective function for each coverage;

• a pruning procedure which produces a succession of cov-erages (and thus of min-max networks) characterized bydecreasing complexity.

The latter procedure is an iteration of the following threesteps:

1) fusion procedure;2) network generation and computation of functionde-

fined by (2);3) structure reduction step, where some negligible hyper-

boxes are deleted from the actual coverageaccordingto a previously defined criterion.

The structure reduction criterion is intended to generate a suc-cession of networks characterized by a decreasing complexityof the decision region boundaries. For each pure hyperboxin , we compute the number of patterns of covered by

and associated with the class label; the hyperboxes areordered by increasing values of. The structure reduction sub-step deletes the set of hyperboxes associated with the smallest

.The pruning procedure stops when the deletion of some pure

hyperboxes makes the actual coverage not complete. In Fig. 3,an example of the pruning procedure is shown. Let us considerthe training set of Fig. 3(a); after the ARC procedure, the totalcoverage in Fig. 3(b) is constituted by 20 hyperboxes, sincenoise is present in the center of the four main clusters. Duringthe first structure reduction step, the three hyperboxes coveringonly one pattern are deleted; the network resulting after the sub-sequent fusion procedure is shown in Fig. 3(c). The hyperboxcovering two patterns is deleted by the second structure reduc-tion step; after the fusion procedure we obtain the network inFig. 3(d) consisting of only four hyperboxes, which is also the

Fig. 4. PARC flow chart.

regularized net yielded by the PARC training algorithm. This ex-ample demonstrates that the pruning procedure is fundamentalto perform an automatic optimization of the network structureand, at the same time, it imparts noise robustness to the wholetraining procedure. The PARC algorithm is depicted in Fig. 4.

It is important to note that each coverage produced by ARC isa total coverage, where each pattern ofis covered by a hybridhyperbox or by (at least) a pure one. On the other hand, amongthe coverages determined by PARC, only the first oneis atotal coverage. In fact, after the structure reduction step, therewill be a nonempty subset of that is no more covered byany hyperbox. Moreover, during the PARC fusion procedure itis possible that some patterns of are covered by a hyperbox

obtained from the fusion between a pair of hyperboxes

and associated with the same class label. Thus, evenif and are pure (in a strict sense), can be hybrid,although we will associate with it the class label. In this casewe say that is apseudopurehyperbox. During training eachpseudopure hyperbox is treated as a pure one. In particular, thenetwork generation procedure will define a labeled hidden layerneuron for each pure and pseudopure hyperbox in the currentcoverage.

IV. SENSITIVITY ANALYSIS OF ARC/PARC CLASSIFIERS

A low automation degree can be a serious drawback for a clas-sification system, as previously discussed in Section I. In thissection, we propose a way to measure the automation degree ofa classification system by relying on its sensitivity to the trainingparameters.

Let us consider a training parameter in the rangeand let be the set of

the remaining parameters; by fixing to a specific determi-nation , we have . From the user pointof view, the parameter is critical if a slight change ofcorresponds to a large change in , i.e., if the average valueof over is high. Thus it is possible tocompare the automation degree of two or more classificationsystems, with respect to a given instance of a classificationproblem, by using the following sensitivity measure:

(7)

Page 7: Adaptive Resolution Min-Max Classifiers

408 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 2, MARCH 2002

The evaluation of the expression (7) requires to compute thederivative of ; since its explicit expression is unknown, wemust replace the derivative with some approximate expressionbased on suitable samples of . Let bea uniform sampling of , with

and let bethe sampling interval, so that .Thus we can approximate the sensitivity index (7) as follows:

(8)

When comparing two or more classification systems the per-formance measure and the number of samplesmust be fixedin advance. However, the discrete derivativeof still de-pends on , i.e., on the difference between and .Thus, it is useful to normalize the expression (8) to the max-imum theoretical value of , where

and are, respectively, the maximum and minimumvalues that the chosen performance measure can assume.

Taking into account this normalization, we define thesensi-tivity of a classification system with respect toas

(9)

By using expression (9), the proposed ARC/PARC techniquewill be compared with respect to the original Simpson’s TA andto the OMM TA. The comparison will be carried out on thewell-known Cancer data set, which is a real data benchmarkcoming from the Proben1 collection [22]. Each benchmark of[22] is proposed on three different permutations of the samedata set. In the following we will consider the first permuta-tion (Cancer1). This is an instance of an exclusive classificationproblem, consisting of 525 patterns in the training set and 174in the test set; each pattern has nine components and is labeledby one of two possible classes, i.e., benignant or malignant.

In the following, the percentage of errors in the training setwill be denoted as , while that on the test set will be denotedas . Similarly, the percentage of indeterminations will bedenoted as and for training and test sets, respectively.The performance measurewill be chosen as the sum ofand ; thus, is equal to 100, while is equal to 0. Itis interesting to note that the performancecan be consideredas the “worst error” that can be realized by the classifier on thetest set.

TABLE IORIGINAL SIMPSON’S MIN-MAX RESULTS

ON CANCER1 DATA SET

TABLE IIOPTIMIZED MIN-MAX RESULTS ONCANCER1 DATA SET

Once the membership function type and the slope parameterhave been fixed (we will use the default value ), the

behavior of the classification systems to be compared is as fol-lows: the original Simpson’s min-max (SMM) TA will dependboth on the parameter value and on the specific permutation ofthe training set; OMM TA will depend both on the pattern pre-sentation order (being a modified version of SMM) and onpa-rameter; ARC and PARC training algorithms will depend onlyon the value. Consequently, in order to compare the classifica-tion systems sensitivities, the following sets of samples will beconsidered (please see the equation at the bottom of the page).In order to build a set of samples for the OMM training algo-rithm, we considered the interval [0.01; 0.99], with a samplingstep equal to 0.01.

In Tables I, II, III, and IV, we report the results obtained bySMM, OMM, ARC, and PARC, respectively. In these tables,label “HBs” indicates the number of hyperboxes correspondingto the number of neurons in the hidden layer, while the timesreported refer to the training procedures, which are executed onthe same hardware platform and software environment. In theadditional column of Table II, which is labeled , is reportedthe value corresponding to the regularized net. It can be seenthat both the SMM TA and the OMM TA are characterized by

for SMM

for OMM, ARC and PARC

Page 8: Adaptive Resolution Min-Max Classifiers

RIZZI et al.: ADAPTIVE RESOLUTION MIN-MAX CLASSIFIERS 409

TABLE IIIARC RESULTS ONCANCER1 DATA SET

TABLE IVPARC RESULTS ONCANCER1 DATA SET

high percentages of indeterminate patterns. We have examinedthis phenomenon trough a set of two-dimensional toy problemsand we have ascertained that a high number of indeterminationis due to the contraction mechanism of the original Simpson’slearning procedure [12]. In fact, when two hyperboxes associ-ated with different class labels overlap, the contraction mecha-nism consists in reducing the size of both the hyperboxes alongthe direction which minimizes the amplitude of the contraction.This resizing operation causes the two contracted hyperboxesto have a common boundary hyperplane. Since the originalmembership function is maximum in the inner part of the hy-perbox, including the boundary hyperplanes, a pattern onbe-longs with the same membership value to both the hyperboxes.Obviously, the probability that a pattern lies on a hyperplane de-pends on the adopted number representation and usually is verylow. However, when during training the number of overlaps andcontractions hugely increases, these regions become more rele-vant. This phenomenon is particularly serious in some classifi-cation problems where the patterns are described by many dis-crete attributes (i.e., attributes which are defined in a finite setof values), such as in the Cancer data set. As a consequence, thenumber of indeterminations tends to increase.

As expected, training times in ARC and PARC are indepen-dent of parameter, while this is not the case for the originalSimpson’s training algorithm and, especially, for the OMM.This is due to the fact that the resulting network complexitydepends mostly on the value assigned to theparameter. ARCand PARC yield good classification performances with a highautomation degree, as confirmed by the sensitivity analysis in

TABLE VSENSITIVITY AND AVERAGE PERFORMANCE ONCANCER1 DATA SET

Table V. Moreover, the average complexity of the networksobtained by ARC and PARC is much lower than the averagecomplexity of the networks generated by SMM and OMMclassification systems.

In conclusion, ARC and PARC outperform both SMM andOMM in terms of automation degree and training times. More-over, they exhibit also a better generalization capability, as willbe further discussed in the next section.

V. PERFORMANCEANALYSIS OF ARC/PARC CLASSIFIERS

Exhaustive simulation tests were carried out for ascertainingthe performance of the proposed algorithms. In order to showthe benefits of the ARC/PARC technique with respect to theoriginal training algorithms, we will consider two-dimensionaltoy problems. The first one will regard the ARC/PARC indepen-dence of the pattern presentation order; the second one will evi-dence the ARC/PARC adaptive resolution mechanism, which isrelated to the covering accuracy of the data space.

The proposed algorithms will be investigated also by con-sidering well-known real classification problems, which yield awider evaluation of ARC/PARC performances. In fact, they canbe compared with the results obtained on the same data sets bythe numerous classification systems proposed in the technicalliterature.

A. Two-Dimensional Toy Problems

All tests in this section were carried out by fixing the slopeparameter of the membership function to . The firstclassification problem is defined by the training set of Fig. 5(a),where the patterns are organized in alternate rows of two dif-ferent classes. In a first test, the training set is ordered by usingthe vertical coordinate as the ordering key. Fig. 5(b) shows thehyperboxes obtained by OMM with . It yields zero er-rors on the training set with .

In a second test, the presentation order was changed, using thehorizontal coordinate as the ordering key. The OMM algorithmwith gives the hyperboxes of Fig. 5(c), which yields

% and % with . The two previoustests show that the presentation order is critical for the min-maxalgorithm. Fig. 5(d) shows the hyperboxes obtained by PARCwith . The resulting net is the same, regardless of patternpresentation order.

Let us consider a second toy problem in which we want toreconstruct with high accuracy the decision region of Fig. 6(a);for this reason we fix , once for all. The training set,shown in Fig. 6(b), consists of 333 points. Fig. 6(c) shows thehyperboxes resulting from OMM; this net yields %

Page 9: Adaptive Resolution Min-Max Classifiers

410 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 2, MARCH 2002

(a) (b)

(c) (d)

Fig. 5. Dependency of min-max networks on the pattern presentation order:(a) the training set; (b) OMM result with vertical ordering; (c) OMM result withhorizontal ordering; (d) PARC result.

and % with 64 hyperboxes. By using PARC, thenumber of hyperboxes is reduced, as shown in Fig. 6(d); this netyields zero errors on the training set with only 14 hyperboxes.In conclusion, these toy problems confirm that ARC and PARCdo not depend on pattern presentation order, and that they canachieve better performances with less complex networks.

B. Real Data Benchmarks

We will compare ARC/PARC with respect to OMM, by usinga set of real data benchmarks. Moreover, for a wider evaluationof the performances of the new training algorithms, we need tocompare their accuracy with respect to some reference methods.In this regard, we will use a simple probabilistic classifier (SPC)and the classical-NN classifier.

SPC determines the class label of an input pattern by using arandom generator based on the class recurrences in the trainingset. For instance, given a class classification problem, the re-currence of the class , is defined as ,where is the overall number of patterns of the training set and

is the number of patterns belonging to the class. When nopenalty terms are assigned to weight the classification errors, theSPC performance can be evaluated by its mean statistical error,determined by

(10)

where is the overall number of patterns of the test set andis the target class label of itsth pattern. It is evident that

SPC results do not depend on the geometrical distribution ofdata in the feature space, which is one of the most importantinformation to be considered in a classification task. However,the SPC error (10) can be considered a partial measure of the

(a) (b)

(c) (d)

Fig. 6. Classification accuracy of min-max networks: (a) the decision region;(b) the training set; (c) OMM result; (d) PARC result.

difficulty to classify the data set under consideration, since itquantifies the data spread among the different classes. It seemssuperfluous to remark that a powerful classifier must hugelyoutperform SPC.

The -NN rule is a well-known classification technique [23].It holds some interesting asymptotic properties; for example,when and the number of patterns , the -NN ruleis equivalent to classifiers based on Bayesian decision theory[24]. Even if the theoretical performance of-NN should de-crease when is limited, the -NN may produce better resultswith respect to statistical classifiers based upon a mixture model(since -NN does not need anya priori hypothesis) [1]. There-fore, -NN can be considered as a simple but valid reference al-gorithm, where the data distribution in the feature space is takeninto consideration. As for the SPC, it fully contrasts with theneural-network approach, since no classification model (as de-fined in Section I) is inferred. Moreover, the-NN techniqueis characterized by a high computational cost, since the clas-sification procedure coincides with the search of thenearestneighbor patterns in a look-up table. In the following, we willuse a simple version of-NN based on the Euclidean norm.

Regarding the OMM algorithm, the slope parameter of themembership function is fixed to and the learning weightto . For -NN classifiers, all the values ofranging fromone to 50 were considered, but only the results concerning theone scoring the best classification error are reported. The inputpatterns of both the training and the test set were normalized inthe range from zero to one prior to learning and classification.

1) Ringnorm: The Ringnorm data set [25] is a 2 class clas-sification problem consisting of 20 dimensional patterns, thefirst 5000 of which are used as test set and the following 300as training set. Each class is drawn from a multivariate normaldistribution: class 1 has zero mean and covariance four timesthe identity; class 2 has unit covariance and mean slightly dif-ferent from zero. The results concerning the Ringnorm data set

Page 10: Adaptive Resolution Min-Max Classifiers

RIZZI et al.: ADAPTIVE RESOLUTION MIN-MAX CLASSIFIERS 411

TABLE VIRINGNORM CLASSIFICATION RESULTS

TABLE VIIGLASS1 CLASSIFICATION RESULTS

are reported in Table VI. As expected, due to the quasiconcentriclocation of the two normal distributions, the best-NN perfor-mance is obtained for . The -NN performance is similarto that of the OMM classifier. Moreover, the balanced spread ofdata between the two classes yields an SPC performance that isclose to 50%. By using ARC/PARC there is a massive incrementof performance, especially when compared to the classificationerror of 21.4% obtained in [25] by using the CART classifiers.The best results are obtained in this case by the ARC net, whichis slightly more complex than the one generated by PARC, butlargely simpler than the OMM net.

2) Glass: The Glass data set concerns with the classificationof glass types and it is contained in the mentioned Proben1 col-lection [22]. Similarly to the Cancer data set, we used the firstpermutation (Glass1) of the Glass data set. It is constituted by214 patterns, the first 161 in the training set and the following53 in the test set. Each pattern is composed by nine features (re-fractive index plus the chemical analysis of eight different ele-ments) measured from one among the six glass types (classes)that must be detected. The results of this benchmark are reportedin Table VII. Also in this case, the best-NN performance isobtained for . On the other hand, SPC and OMM yieldinadequate results, the latter having a very high number of inde-terminations. ARC/PARC have equivalent performances, whichare superior to that of-NN. In this case, the complexity of theARC/PARC nets is larger than that of OMM, while maintaininga better performance. These results confirm the flexibility ofARC/PARC technique in the selection of an appropriate modelcomplexity, in order to obtain the best generalization capability.Moreover, the performance of ARC/PARC (around 26%) is al-ways better than the best performance (around 32%) obtainedin [22] by using different neural-network approaches.

3) Diabetes: Also the Diabetes data set is contained in theProben1 collection. Once again, we used its first permutation(Diabetes1). It is composed by 768 patterns, the first 576 in thetraining set and the following 192 in the test set. Each patternis composed by eight (physiological) features that are used todecide if the corresponding person is diabetic or not (2 classproblem). The results reported in Table VIII show that the best

TABLE VIIIDIABETES1 CLASSIFICATION RESULTS

TABLE IXIRIS CLASSIFICATION RESULTS

performing algorithms are ARC and PARC. They have identicalclassification errors on the test set, while the network obtainedby PARC is less complex. The best-NN performance is ob-tained with , and it is also better than the one of OMM.The performances of ARC/PARC, obtained for the Diabetes1data set, are very close to the best ones reported in [22].

4) Iris: The Iris data set consists of 150 sample vectors ofiris flowers [26]. Each pattern is composed by four features(sepal length, sepal width, petal length, petal width), and threeclasses of flowers (Iris Versicolor, Iris Virginica, and Iris Setosa)are present. The first 90 patterns are used in the training set andthe next 60 in the test set. The best performance is obtained byARC and PARC, with a classification error of 1.6667% corre-sponding to only one misclassification (see Table IX).

The results obtained from the previous real data benchmarksshow that ARC and PARC outperform OMM,-NN, and SPCclassifiers in the most meaningful aspects of the classifica-tion problems: the generalization capability, the automationdegree, and training time. These results also demonstratethat ARC/PARC algorithms favorably compare with the bestperformances obtained by different approaches reported in thetechnical literature. Moreover, the nets trained by ARC/PARCare independent of the pattern presentation order, as confirmedby the tests previously carried out on two-dimensional toyproblems.

Therefore, it is reasonable to adopt the ARC/PARC techniqueas the reference training algorithm for fuzzy min-max classifi-cation models. Consequently, we will evaluate any further im-provement in min-max classification systems with respect to it.

VI. I MPROVING ARC/PARC TECHNIQUE BY

A RECURSIVECUTTING STRATEGY

The underlying idea of a possible improvement of theARC/PARC technique is to isolate recursively pure hyperboxes.In order to explain the new cutting strategy, we need to statesome further definitions. Let be the minimum size hybridhyperbox containing all the training set patterns. Let bethe minimum size hyperbox that covers all the patterns of set

Page 11: Adaptive Resolution Min-Max Classifiers

412 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 2, MARCH 2002

Fig. 7. An example of hyperbox overlap groups in the hybrid hyperboxH .

associated with class, and letbe the whole set of these hyperboxes. We define hyperboxoverlap group (HOG) a subset of such that the union ofall hyperboxes in the HOG is a connected region of the inputspace. Depending on the topological structure of set, we canhave more than one HOG for each hybrid hyperbox (see Fig. 7).

Let be the set of all nonempty intersections (hyperboxes)between each pair of hyperboxes in a HOG; by construction, thehyperboxes in are all hybrid. For example, the subset forthe HOG of Fig. 7 is equal to and itis shown in Fig. 8. We define extended hybrid hyperbox (EHH)the minimum size hyperbox containing all hyperboxes in.Let be the set of the boundary hy-perplanes of the EHH. Given an HOG and its associated EHH,the new cutting procedure consists in cutting all the class hy-perboxes in the HOG by all the hyperplanes in(dotted linesin Fig. 8). In the following, we illustrate step by step the wholetraining procedure denoted as recursive ARC (R-ARC).

1. Initialization.1.1. Determination ofH0;1.2. SetLp as empty,Lh = fH0g;

2. Do whileLh is not empty2.1. Set hyperbox setsLp-temp andLh-temp as empty;2.2. For each hybrid hyperboxHq in Lh do

2.2.1. Compute the setL(q)HOG = fHOG(q)1 ;HOG

(q)2 ; . . .g of

HOGs inHq;2.2.2. For eachHOG(q)

k in L(q)HOG do2.2.2.1.1. ComputeEHH(q)

k and perform each cut determinedby it;

2.2.2.1.2. Add new pure and hybrid hyperboxes (deriving fromcutting procedures) toLp-temp andLh-temp;

End ForEnd For

2.3. AddLh-temp toLh; addLp-temp to Lp;2.4. Perform fusion step onL = Lp [ Lh;2.5. Generate a new min-max network, on the basis of hyperboxes

contained inLp. Compute the

objective functionF as in (2);End While

3. The final regularized network is the one which minimizes functionF as in (2).

Fig. 8. The EHH related toHOG of Fig. 7.

TABLE XR-ARC/R-PARC RESULTS FORRINGNORM DATA SET

TABLE XIR-ARC/R-PARC RESULTS FORGLASS1 DATA SET

The recursive nature of the algorithm derives from step 2.2:the same cutting procedure, accomplished for a parent hybridhyperbox, will be performed for each hybrid hyperbox in its off-spring. It is important to underline that, while the ARC trainingalgorithm generates a min-max network whenever a new purehyperbox is added to the set, R-ARC produces a networkonly after processing all HOGs of all hybrid hyperboxes in-cluded in set at a fixed step (that is, only after step 2.2 is com-pletely accomplished). We can also derive from R-ARC anothertraining algorithm: the recursive PARC (R-PARC). R-PARC canbe easily obtained from the PARC algorithm by substituting theinitial ARC procedure with an R-ARC one.

The validity of the new cutting strategy has been verifiedby several tests carried out on real data benchmarks, as forARC/PARC algorithms. For instance, we report in Tables Xand XI the results of R-ARC and R-PARC on Ringnorm andGlass1 data sets, respectively. The comparison with respect tothe results previously obtained by ARC and PARC evidencesan encouraging increase of performance.

VII. CONCLUSION

In classification systems design, besides generalizationcapability and noise robustness, it is very important to takeinto account the automation degree of the training procedure.In fact, automatic training procedures are essential to bring

Page 12: Adaptive Resolution Min-Max Classifiers

RIZZI et al.: ADAPTIVE RESOLUTION MIN-MAX CLASSIFIERS 413

state-of-the-art modeling techniques in the field of practicalapplications and to design efficient systems able to work ina self-governing way, without any need of human expertise.In this regard, we consider the constructive approach as afundamental strategy to improve the automation degree of amodeling system, since structural parameters are automaticallyestablished during training. As shown, the automation degreecan be measured in terms of the sensitivity of the system withrespect to training parameters.

Among neuro-fuzzy networks, a well-known classificationsystem based on a constructive training algorithm is the SMM.The generalization capability of the original min-max classifierdepends mostly on position and size of the hyperboxes gener-ated during training. Moreover, also in its optimized versions,min-max performances depend significantly on the data presen-tation order and the shape of the decision region is affected bythe constraint on the maximum size allowed for each hyperbox.

An evident advance has been obtained by adopting the adap-tive resolution procedure of the ARC/PARC technique. Thisprocedure is based on a particular cutting strategy, which playsa core role in improving the reconstruction accuracy of the deci-sion regions. The adaptive resolution technique outperforms theoriginal training algorithm in terms of generalization capabilityand training time. Moreover, we have shown that the automationdegree of ARC/PARC is much higher than SMM and OMM,when tested on real data problems. The proposed classificationsystems are highly automatic, since their training algorithms donot depend on pattern presentation order and no critical param-eter must be fixed in advance by the user. The overall good per-formances of ARC/PARC are confirmed by the results reportedin the paper.

Finally, we have also proposed a possible improvement of thetraining procedure, by adopting a new cutting strategy that aimsto process recursively hybrid hyperboxes (R-ARC/R-PARC). Afurther prospect for obtaining better decision regions, paying ahigher computational cost, consists in adopting “generalized”hyperboxes, by which the constraint on the boundary hyper-planes to be parallel to the coordinate axes is removed. By usingunconstrained hyperboxes, it is possible to obtain a substantialimprovement of the generalization capability [18], [27].

ACKNOWLEDGMENT

The authors wish to thank the anonymous referees and Prof.G. Martinelli (INFO-COM Dpt., University of Rome “LaSapienza”) for their helpful comments that contributed to theimprovement of the final version of the paper.

REFERENCES

[1] J. C. Bezdek, “A review of probabilistic, fuzzy, and neural models forpattern recognition,”J. Intell. Fuzzy Syst., vol. 1, no. 1, pp. 1–25, 1993.

[2] F. M. F. Mascioli, G. Risi, A. Rizzi, and G. Martinelli, “A nonexclusiveclassification system based on cooperative fuzzy clustering,” inProc.EUSIPCO’98, Rhodes, Greece, 1998, pp. 395–398.

[3] G. Costantini, F. M. F. Mascioli, A. Rizzi, and G. Martinelli, “Recog-nition of musical instruments by a nonexclusive neuro-fuzzy classifier,”presented at the Proc. ECMCS’99, Krakow, Poland, 1999.

[4] G. Costantini, F. M. F. Mascioli, and P. Antici, “Two nonexclusiveneuro-fuzzy classifiers for recognition of musical instruments,” inProc.JIM’99, Issy-Les-Moulineaux, France, 1999, pp. 51–58.

[5] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform convergence ofrelative frequencies of events to their probabilities,”Theory ProbabilityApplicat., vol. 16, no. 2, pp. 264–280, 1971.

[6] J. Rissanen, “Stochastic complexity and modeling,”Ann. Statist., vol.14, pp. 1080–1100, 1986.

[7] E. B. Baum and D. Haussler, “What size net gives valid generalization?,”Neural Comput., vol. 1, pp. 151–160, 1989.

[8] S. Amari, N. Murata, and S. Yoshizawa, “A criterion for determining thenumber of parameters in an artificial neural model,” inProc. ICANN’91,Helsinki, Finland, 1991, pp. 9–14.

[9] B. Amirikian and H. Nishimura, “What size network is good for gener-alization of a specific task of interest?,”Neural Networks, vol. 7, no. 2,pp. 321–329, 1994.

[10] S. B. Holden and M. Niranjan, “On the practical applicability of VCdimension bounds,”Neural Comput., Oct. 1994.

[11] J. C. Bezdek, W. Q. Li, Y. Attikiouzel, and M. Windham, “A geometricapproach to cluster validity for normal mixtures,”Soft Comput., vol. 1,pp. 166–179, 1997.

[12] P. K. Simpson, “Fuzzy min-max neural networks—Part 1: Classifica-tion,” IEEE Trans. Neural Networks, vol. 3, pp. 776–786, 1992.

[13] A. Joshi, N. Ramakrishman, E. N. Houstis, and J. R. Rice, “On neurobi-ological, neuro-fuzzy, machine learning, and statistical pattern recogni-tion techniques,”IEEE Trans. Neural Networks, vol. 8, pp. 18–31, 1997.

[14] V. Petridis and V. G. Kaburlasos, “Fuzzy lattice neural network (FLNN):A hybrid model for learning,”IEEE Trans. Neural Networks, vol. 9,1998.

[15] M. Meneganti, F. S. Saviello, and R. Tagliaferri, “Fuzzy neural networksfor classification and detection of anomalies,”IEEE Trans. Neural Net-works, vol. 9, 1998.

[16] B. Gabrys and A. Bargiela, “General fuzzy min-max neural network forclustering and classification,”IEEE Trans. Neural Networks, vol. 11, pp.769–783, 2000.

[17] A. Rizzi, F. M. F. Mascioli, and G. Martinelli, “Adaptive resolutionmin-max classifier,” Proc. WCCI/FUZZ-IEEE’98, pp. 1435–1440,1998.

[18] A. Rizzi, “Automatic training of Min-Max classifiers,” inNeuro-FuzzyPattern Recognition. ser. Series in Machine Perception and Art. Intel-ligence, H. Bunke and A. Kandel, Eds. Singapore: World Scientific,Dec. 2000, vol. 41, pp. 101–124.

[19] A. Rizzi, M. Panella, F. M. F. Mascioli, and G. Martinelli, “A recur-sive algorithm for fuzzy min-max networks,” presented at the Proc.IJCNN2000, Como, Italy, 2000.

[20] F. M. F. Mascioli, G. Martinelli, and A. Rizzi, “A constructive algorithmfor fuzzy neural networks,” inProc. ICASSP’97, vol. 4, Munich, Ger-many, 1997, pp. 3193–3196.

[21] S. Haykin, Neural Networks, a Comprehensive Foundation, 2nded. Upper Saddle River, NJ: Prentice-Hall, 1999.

[22] L. Prechelt. (1994, Sept.) PROBEN 1: A Set of Neural Net-works Benchmark Problem and Benchmarking Rules. Fakultätfür Informatik, Univ. Karlsruhe, Germany. [Online]. Available:http://wwwipd.ira.uka.de/~prechelt/NIPS_bench.html

[23] B. Dasarasthy, Nearest Neighbor Pattern Classification Tech-niques. Los Alamitos, CA: IEEE Comput. Soc.Press, 1991.

[24] S. Theodoridis and K. Koutroumbas,Pattern Recognition. New York:Academic, 1999.

[25] L. Breiman. (1996, Apr.) Bias, Variance and Arcing Clas-sifiers. Statist. Dept., Univ. California. [Online]. Available:http://www.cs.toronto.edu/~delve/data/ringnorm/desc.html

[26] E. Anderson, “The Irises of the gaspe peninsula,”Bull. Amer. Iris Soc.,no. 59, pp. 2–5, 1935.

[27] A. Rizzi, F. M. F. Mascioli, and G. Martinelli, “Generalized min-maxclassifier,”Proc. FUZZ-IEEE 2000, vol. 1, pp. 36–41, May 2000.

Page 13: Adaptive Resolution Min-Max Classifiers

414 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 2, MARCH 2002

Antonello Rizzi was born in Rome, Italy, in 1965. He received the Dr.Eng.degree in electronic engineering from the University of Rome “La Sapienza”in 1995 and the Ph.D.degree in information and communication engineering in2000, from the same university.

In September 2000, he joined the Information and Communication Depart-ment (INFO-COM Dpt.) of the University of Rome “La Sapienza” as an Assis-tant Professor. His major fields of interest include supervised and unsuperviseddata driven modeling techniques, neural networks, fuzzy systems, and evolu-tionary algorithms. His research activity concerns the design of automatic mod-eling systems, with particular emphasis on classification, clustering, functionapproximation and prediction problems. He is interested in classification sys-tems for structured patterns, symbolic classification and template matching. Heis author of more than 30 publications.

Massimo Panellawas born in Rome, Italy, on June 23, 1971. He received theDr.Eng. Degree, with honors, in electronic engineering from the University ofRome “La Sapienza” in 1998. Currently, he is also pursuing the Ph.D. degree ininformation and communication engineering, which is expected to be receivedin 2002.

In 2001, he joined the Department of Information and CommunicationScience (INFO-COM) of the University of Rome “La Sapienza,” where heis an Assistant Professor (Researcher) and he is a Lecturer in Circuit Theoryand in Circuits and Algorithms for Signal Processing. His research activity isrelated to the use of neural networks, fuzzy logic, evolutionary computation,and statistical models, for the solution of both supervised and unsupervisedlearning problems. In particular, his major fields of interest include patternrecognition, function approximation, nonlinear prediction and nonlinear systemidentification with application to chaotic systems. He also conducts researchon RNS circuits for spread spectrum and chaos-based communication systems.

Fabio Massimo Frattale Mascioli was born in Rome, Italy, in 1963. He re-ceived the Laurea degree in electronics engineering in 1989 and received thePh.D. degree in 1995 from the University of Rome “La Sapienza.”

In 1996 he joined the INFOCOM Department of the University “La Sapienza”of Rome as a Researcher. Since 2000, he has been Associate Professor of CircuitTheory at the same department. His current research interest mainly regardsneural networks and neuro-fuzzy systems and their application to clustering,classification and function approximation problems. He is author or coauthor ofseveral papers published on international journals or presented in internationalconferences.