Top Banner
analytica chimica acta 622 ( 2 0 0 8 ) 85–93 available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/aca Multivariate range modeling, a new technique for multivariate class modeling The uncertainty of the estimates of sensitivity and specificity M. Forina , P. Oliveri, M. Casale, S. Lanteri Department of Pharmaceutical and Food Chemistry and Technology, University of Genova, Via Brigata Salerno 13, 16147 Genova, Italy article info Article history: Received 6 March 2008 Received in revised form 12 May 2008 Accepted 28 May 2008 Published on line 3 June 2008 Keywords: Chemometrics UNEQ SIMCA abstract MRM, multivariate range modeling, is based on models built as parallelepipeds in the space of the original variables and/or of discriminant variables as those of linear discriminant analysis. The ranges of these variables define the boundary of the model. The ranges are increased by a “tolerance” factor to take into account the uncertainty of their estimate. MRM is compared with UNEQ (the modeling technique based on the hypothesis of mul- tivariate normal distribution) and with SIMCA (based on principal components) by means of the sensitivities and specificities of the models, the estimates of type I (sensitivity) and II error rates (specificity) evaluated both with the final model built with all the available objects and by means of cross validation. UNEQ and SIMCA models were obtained with the usual critical significance value of 5% and with the model forced to accept all the objects of the modeled category. The performance parameters of the class models are critically discussed focusing on their uncertainty. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Class modeling techniques (CMT) answer to the general ques- tion: “Can object O, declared of class A, really belong to class A?” This question is typical of many practical problems, e.g. in the traceability of PDO (Protected Denomination of Origin) foods and in multivariate quality control. On the contrary, the classification techniques assign objects to one of the classes in the problem. The proba- bilistic classification techniques, as the linear discriminant analysis (LDA), assign an object to the class with the maxi- mum posterior probability, p(c/x), the probability that, being Corresponding author. Tel.: +39 0103532630; fax: +39 0103532684. E-mail address: [email protected] (M. Forina). x the vector of variables that describe the object, its class be c. These techniques are not very useful in the control of qual- ity, variety, origin or genuineness of a sample. However, almost all research papers on control of foods use classification techniques, and also when a class modeling technique is applied, the attention is focused on its classifi- cation performance, not on its modeling characteristics. Only in multivariate quality control, the modeling techniques are properly used. In this case, where only one category is studied, the use of a classification technique is clearly a non-sense. In univariate statistics the answer to the above general question “Can object O, stated of class A, really belong to class 0003-2670/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2008.05.065
9

Multivariate range modeling, a new technique for multivariate class modeling

Apr 30, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multivariate range modeling, a new technique for multivariate class modeling

MmTa

MD

a

A

R

R

1

A

P

K

C

U

S

1

CtA

if

obam

0d

a n a l y t i c a c h i m i c a a c t a 6 2 2 ( 2 0 0 8 ) 85–93

avai lab le at www.sc iencedi rec t .com

journa l homepage: www.e lsev ier .com/ locate /aca

ultivariate range modeling, a new technique forultivariate class modeling

he uncertainty of the estimates of sensitivitynd specificity

. Forina ∗, P. Oliveri, M. Casale, S. Lanteriepartment of Pharmaceutical and Food Chemistry and Technology, University of Genova, Via Brigata Salerno 13, 16147 Genova, Italy

r t i c l e i n f o

rticle history:

eceived 6 March 2008

eceived in revised form

2 May 2008

ccepted 28 May 2008

ublished on line 3 June 2008

eywords:

a b s t r a c t

MRM, multivariate range modeling, is based on models built as parallelepipeds in the space

of the original variables and/or of discriminant variables as those of linear discriminant

analysis. The ranges of these variables define the boundary of the model. The ranges are

increased by a “tolerance” factor to take into account the uncertainty of their estimate.

MRM is compared with UNEQ (the modeling technique based on the hypothesis of mul-

tivariate normal distribution) and with SIMCA (based on principal components) by means

of the sensitivities and specificities of the models, the estimates of type I (sensitivity) and

II error rates (specificity) evaluated both with the final model built with all the available

hemometrics

NEQ

IMCA

objects and by means of cross validation. UNEQ and SIMCA models were obtained with the

usual critical significance value of 5% and with the model forced to accept all the objects of

the modeled category.

The performance parameters of the class models are critically discussed focusing on their

uncertainty.

properly used. In this case, where only one category is studied,

. Introduction

lass modeling techniques (CMT) answer to the general ques-ion: “Can object O, declared of class A, really belong to class?”

This question is typical of many practical problems, e.g.n the traceability of PDO (Protected Denomination of Origin)oods and in multivariate quality control.

On the contrary, the classification techniques assignbjects to one of the classes in the problem. The proba-

ilistic classification techniques, as the linear discriminantnalysis (LDA), assign an object to the class with the maxi-um posterior probability, p(c/x), the probability that, being

∗ Corresponding author. Tel.: +39 0103532630; fax: +39 0103532684.E-mail address: [email protected] (M. Forina).

003-2670/$ – see front matter © 2008 Elsevier B.V. All rights reserved.oi:10.1016/j.aca.2008.05.065

© 2008 Elsevier B.V. All rights reserved.

x the vector of variables that describe the object, its class bec.

These techniques are not very useful in the control of qual-ity, variety, origin or genuineness of a sample.

However, almost all research papers on control of foodsuse classification techniques, and also when a class modelingtechnique is applied, the attention is focused on its classifi-cation performance, not on its modeling characteristics. Onlyin multivariate quality control, the modeling techniques are

the use of a classification technique is clearly a non-sense.In univariate statistics the answer to the above general

question “Can object O, stated of class A, really belong to class

Page 2: Multivariate range modeling, a new technique for multivariate class modeling

a c t

86 a n a l y t i c a c h i m i c a

A?” corresponds to a bilateral significance test with the nullhypothesis:

“H0: x is not significantly different from m, estimated meanof the variable for the labeled class”.

The univariate test is usually performed with the use of theGosset Student statistics, where the statistical significance ofthe value x for the test class c is computed by the integral ofthe probability density of the Student’s t variable:

˛ = 2

∫ ∞

t(x)

t dt where t(x) = |x − m|s

(1)

and m and s are the estimated mean and standard deviationof the variable x in the test class.

The null hypothesis is rejected when t is larger than thecritical value at a selected critical significance.

Both the probabilistic classification and the test are basedon the estimate of the probability density, but in a very differ-ent way, so that given two classes, namely c and g, a samplecan be classified in g but also it can be accepted by the test forthe class c.

For this reason the Chemometrics techniques to be usedare the so called “class-modeling” techniques. A class modelis characterized by two parameters: the sensitivity, comple-mentary percent measure of correct I decisions, i.e. of type Ierrors, and by the specificity, complementary percent measureof correct type II decisions:

Null HypothesisH0 TRUE

Null HypothesisH0 FALSE

Statistical decision:reject H0

Type I error Correct IIdecision

Statistical decision:do not reject H0

Correct Idecision

Type II error

Therefore, a “perfect” class model has 100% sensitivity and 100% speci-ficity.

2. Class modeling techniques

In principle a class model can be obtained with a training setwhere only the objects of the studied category are present.However, this statement has a significant meaning only whenthe variables are well-recognized descriptors of the character-istics of the class.

CMT have two main application types.The first type (Application type I) is typical of multivari-

ate quality control. The class model is built with the use of asuitable number of samples (surely of the category) describedby variables (chemical, physical or sensorial) relevant for thedescription of the quality. Also, these variables frequently havelarge discriminant power, because their value can characterizethe samples of the category compared with samples of differ-ent categories. The samples must represent the variability ofthe class regarding the quality parameters. In this case, the

specificity is not important. The model is used to perform anacceptance test.

The second application (Application type II) is typical of thecontrol of origin of typical foods. The variables (chemical or

a 6 2 2 ( 2 0 0 8 ) 85–93

physical) are not necessarily directly connected with the qual-ity. Their discriminant power is unknown, or discriminationis possible only with the complex models of the multivariateanalysis, where the discriminant power of a variable dependsvery much on its synergy with the other variables. The sam-ples used to build the model (training set) of the studied classmust represent the origin parameters (e.g. latitude, longitude,soil, variety). The class model must have very large sensitiv-ity, possibly 100%. In effect it seems unreasonable to reject asample because, for example, the concentration of a given iso-tope is 1 part per billion when the other samples of the classhave concentration in the range 2–3 part per billion. A furtherpoint is that PDO foods are produced by the members of a con-sortium. Sensitivities less than 100% mean that some of theproducers are refused by the model, and a PDO consortiumgenerally considers such a model unsatisfactory. Specificity isalso important. It is evaluated by means of a set of samples ofother classes (specificity set). These classes must represent thepossible adulterants/imitations. The real value of the estimateof the specificity depends very much on the representativityof the specificity set.

Some modern analytical techniques produce about eachsample a lot of chemical and physical information, frequentlyand not very expensive, and many studies have demonstratedthat this large rough information contains the relevant infor-mation necessary to achieve reliable decisions. In the field oforigin control, spectroscopy (especially near-infrared, but alsovisible, medium-infrared, Raman, NMR spectra), electronicnoses and tongues, isotopic composition from mass-spectra,are applied or can be applied. However, the rough informationalso contains a lot of non-relevant, useless, information. Thepresence of useless information, of noise, is a frequent char-acteristic of variables used in Application type II. Instead, inApplication type I all the variables contain relevant informa-tion. So, a selection procedure is usually applied to retain themore discriminant variables.

The most important multivariate CMT are:

(1) SIMCA [1] (soft independent models of class analogy): themodel is a parallelepiped in the space of the significantcomponents, delimited by the range of the scores. Therange can be expanded or reduced. Because of the flexibil-ity in the number of components and in the range, manySIMCA models can be computed.

(2) UNEQ [2] (UNEQual dispersed classes) works under thehypothesis of multivariate normal distribution of the vari-ables. The origin of this technique is in the work ofHotelling [3] on multivariate quality control. It is basedon the Hotelling T2 statistics, the multivariate general-ization of student statistics. In the case of the number ofobjects in the category is small (compared with the num-ber of variables) UNEQ is used on the significant principalcomponents.

(3) PF [4–6] (potential functions techniques) are based on theestimate of the multivariate probability distribution by

means of the contribution of the objects used to developthe class model. PF were not considered in this study,because they are useful only when the objects have verycomplex distribution, a very rare case with real data.
Page 3: Multivariate range modeling, a new technique for multivariate class modeling

c t a

“ut“ci

Mtsdttt

osbmvsabtacact

cctoaet

iabn

Fm

a n a l y t i c a c h i m i c a a

CMT compute a “Class model” (CM) and for each object aDistance from the model”. Statistical or empirical rules aresed to define a critical distance or maximum permitted dis-ance (MPD) from the model. MPD fixes the boundaries of theclass space”. An object is accepted by the class model (i.e. itan belong to the class) when its distance from the class models less than the MPD.

The statistical or empirical rules used to define the modelPD can increase it, to increase the sensitivity, but generally

he increased MPD causes a decrease of the specificity. Thetatistical rule defines the class MPD by means of a confi-ence level. The sensitivity is the experimental measure ofhe confidence level. When the hypotheses, on which the sta-istical rule is based, are not verified, the difference betweenhe confidence level and the sensitivity can be very large.

Univariate range modeling is based on the allowed rangef the variables. The range can be estimated from the traininget, and sometimes expanded to take into account the possi-ility of underestimation. When this procedure is applied toany variables an object is accepted by the model when all its

ariables are within the corresponding allowed ranges. Fig. 1Ahows that the model, a rectangle in the space of two vari-bles, does not use the information given by the correlationetween the variables. We call this kind of models, based onhe range of the variables, “univariate range modeling” (URM),lso in the case of many variables, because each variable isonsidered separately, without attention to the relationshipsmong variables, that multivariate class modeling techniquesonsider. In principle URM is a wrong technique. By definition,he model has 100% sensitivity.

Both UNEQ and SIMCA were used here with the usual pro-edure, where the class boundary is determined by a selectedonfidence level, here 95%. The sensitivity is the experimen-al measure of this confidence level. To compare the resultsf UNEQ and SIMCA with those of URM and MRM (multivari-te range modeling), the UNEQ and SIMCA class models werexpanded, forced to have 100% sensitivity. This means thathe MPD is the farthest object of the modeled class.

Modeling techniques are sensitive to outliers, especiallyn the case of forced models. Both univariate and multivari-

te outliers can be detected in the first step of data analysisy means of box-wiskers plots, histograms, principal compo-ents or projection pursuit plots. Alternatively, it is possible

ig. 1 – A: Univariate range modeling; B: multivariate rangeodeling; C: multivariate range modeling with tolerance.

6 2 2 ( 2 0 0 8 ) 85–93 87

to use robust techniques. We prefer the first strategy, followedby a careful investigation of the possible origin of the outly-ing data. Frequently the outlier is the consequence of trivialerrors in the laboratory (deterioration of the sample, labelingerror, . . .). In other cases the outlying data seems really anoma-lous, what means that very probably it does not come fromthe same population of the other data. So, to find outliers itis necessary to define the distribution of the data. Generallya normal distribution is assumed. In the case of food data,the major chemical components have a normal distribution.Instead trace components can have a very asymmetrical dis-tribution. A food class is defined by rules, official rules of anational or international organization (e.g. the InternationalOlive Oil Council). A sample satisfying the rules cannot be con-sidered an outlier. The anomaly can be the consequence ofrare conditions-treatments, i.e. of deterministic, not casual,factors. Class models that reject such a sample are not veryuseful in practice.

3. Multivariate range modeling

Taking into account:

(a) the important quality of URM to have sensitivity 100%, bydefinition, and

(b) the need to have class models that experts of the typi-cal foods can easily understand (experts have usually alimited knowledge of Chemometrics and they have somedifficulties to correlate their experience, based on themeasured variables, with “abstract” factors as principalcomponents, multivariate probability density, canonicalvariables),

we developed a new class modeling technique, multivariaterange modeling.

The characteristics of MRM are described below:

(1) some discriminant variables are added to the originalones. Discriminant variables can be the canonical vari-ables of linear discriminant analysis (Fig. 1B), computed forthe pairs of the studied class and one of the other classesin the data set, or for the studied class and single objectsor groups of objects. For each pair of categories only onecanonical variable is obtained. It is the direction with thelargest value of the ratio between the between-categoryand the pooled within-category variance.In case of a large number of variables the LDA canonicalvariables cannot be computed. In this case we selectedthe most discriminant variables by means of stepwise LDA[7]. The selection was performed so that in each selectioncycle the entered variable is that increases more the Maha-lanobis distance between the modeled category and theclosest one. The number of selected variables was limited

to ten.The canonical variables are obtained in the form:

y = b0 + b1x1 + .... + bvxv + .... + bVxV (2)

Page 4: Multivariate range modeling, a new technique for multivariate class modeling

a c t

88 a n a l y t i c a c h i m i c a

The coefficients (loadings) bv are multiplied by the stan-dard deviation sv of the predictor:

bstandardizedv = bvsv (3)

because the contribution to the canonical variable of a pre-dictor with a very small range can be very small also whenthe original coefficient b is large. The predictors are sortedaccording to the value of the standardized coefficient,and the predictors with small standardized coefficient areeliminated stepwise. The elimination is followed by meansof the value of the discriminant power. It is stopped whenthe discriminant power decreases too much. Frequently,the canonical variable is simplified to be a function of3–4 predictors. The use of canonical variables for sin-gle objects or for groups of few objects is justified onlywhen they are important in the representation of the realproblem.

(2) the range of all the variables is computed. The range can beexpanded (Fig. 1C), taking into account the possibility thatthe range computed from the training set underestimatesthe true range, especially when the number of objects issmall. This expansion factor, MRM tolerance, is expressedas percent of the range.The set of ranges defines the MRM model.For each object the distance from the model is the sum, Di,of the distances for each variable, div (where i is the indexof the object and v that of the variable).When the value of the variable xiv is within the range (incase expanded by the tolerance) div is 0, when the value ofthe variable xiv is outside the range of div it is expressedas ratio between the distance from the closest limit of therange and the range itself.

div = 100 × xiv − maxmodelv

maxmodelv − minmodel

v

or 100

× minmodelv − xiv

maxmodelv − minmodel

v

(4)

A further distance, used only for classification when Di iszero, is the distance from the centroid, Dci, defined as thesum of the

dciv = 100 ×∣∣xiv − xv

∣∣maxmodel

v − minmodelv

. (5)

4. Data

Five real data sets have been used to compare the perfor-mances of UNEQ, SIMCA and MRM.

1. White wines [8]: this data set was used only to demon-strate some properties of the UNEQ and MRM models.

Two variables, 3-methylbutan-1-ol and ethanolamine, weremeasured on 94 white wine samples. On the basis of previ-ous studies, a logarithmic transform was applied to obtainan almost normal distribution. However, two samples show

a 6 2 2 ( 2 0 0 8 ) 85–93

very low value for one of the two variables. There are notreasons to consider these two samples as outliers.

2. Processionary [9]: a data set where eleven non-chemicalvariables are used to describe processionaries in two cat-egories (regions). The number of objects is 58. These dataare used as example of categories with large overlap.

3. Olive oil [10,11]: five hundred and seventy-two samples ofolive oil are described by seven fatty acids. The number ofcategories (regions of production) is nine. This data set hasbeen very often used to check the performances of classi-fication techniques.

4. Olive varieties [12]: two hundred and twenty-four sam-ples of monovarietal olive oil are described by means of15 chemical variables. There are three varieties of olives.

5. Parmesan [13]: eighty-eight samples of Parmesan cheeseare described by 21 amino acids. The first category is thatof ripened cheese, the second is of cheese too young to bePDO cheese.

Artificial data sets were used to evaluate the uncertaintyon the estimate of the modeling parameters. They have 2 cat-egories, with the same number of objects (N from 20 to 1000), Vvariables (from 1 to 10). For each variable the value for the firstcategory was extracted from N(0,1), for the second categoryfrom N(d,1). The distance between the centroids of the twopopulations was consequently D =

√dV. It was varied from

0 to 6 with step 0.2 or 0.25. All the variables have the samediscriminant power. The LDA discriminant variable for thepopulations is the straight line joining the two centroids. Fig. 2shows the populations with two variables and D = 2 (Fig. 2A)and D = 6 (Fig. 2B).

For each value of V, N and D, one thousand data sets havebeen extracted from the population (one example is shownin Fig. 3). UNEQ and MRM were applied. The 95% model ofUNEQ was computed by means of the T2 statistics. The forcedmodel (green ellipse in Fig. 3A) corresponds to a Mahalanobisdistance equal to the distance of the object farthest from thecentroid. In the case of MRM the model was obtained usingonly the discriminant direction (Fig. 3B). The validation param-eters and their uncertainty were obtained from their meanvalue on the one thousand data sets and from the limits ofthe 95% confidence interval.

5. Validation

As the other class modeling techniques sensitivity and speci-ficity can be measured with an evaluation procedure, as crossvalidation (CV) used here.

Cross validation divides the objects in a number G of can-cellation groups. The model is developed G times (CV models).Each time the objects in one of the cancellation groups consti-tute the evaluation set, and the other objects constitute thetraining set. All the objects are one time in the evaluationset. Finally, the model is developed with all the objects (finalmodel).

The use of validation for the evaluation of the predictiveability of classification techniques is very common. On thecontrary, the use of CV sensitivity and CV specificity seemsvery rare.

Page 5: Multivariate range modeling, a new technique for multivariate class modeling

a n a l y t i c a c h i m i c a a c t a 6 2 2 ( 2 0 0 8 ) 85–93 89

Fig. 2 – Artificial data—populations represented with 2000objects in both categories. Circles indicate the 95%cv

SU

mlos

Fig. 3 – A statistical sample (2 variables, 50 0bjects in eachcategory). A: UNEQ 95% model for the red class (red) andmodel forced to accept all the red objects (green). B:Discriminant direction used for MRM modeling. (For

also for the two CV groups with the two anomalous samplesoutside. The size of the 95% models is rather small, so that ahigh specificity can be expected. Fig. 4B shows the size of the

Table 1 – 95% Critical values of T2 distribution

Objects 1 variable 2 variables 5 variables 10 variables

20 4.38070 7.50401 16.26522 56.5870425 4.25973 7.14174 15.09859 40.6991230 4.18295 6.91944 14.35686 34.0444735 4.12997 6.76883 13.84624 30.4156440 4.09127 6.66045 13.47201 28.1398745 4.06173 6.57839 13.18706 26.5799350 4.03842 6.51447 12.96208 25.4464955 4.01949 6.46287 12.78083 24.5851360 4.00399 6.42077 12.63083 23.9081865 3.99093 6.38534 12.50531 23.3635770 3.97982 6.35546 12.39804 22.9150075 3.97018 6.32985 12.30611 22.5391680 3.96188 6.30731 12.22583 22.22006

onfidence interval. Arrows indicate the two originalariables. D is the distance between the two centroids.

Here, we used CV with five cancellation groups, and:

CV sensitivity is defined as the percent of the objects in theevaluation sets accepted by the models developed with theobjects in the training set; ˛% CV sensitivity in the case ofUNEQ and SIMCA is that obtained with the models with theclass boundary determined with �% (95%) confidence level;CV specificity (and �% CV specificity) is defined as the percentof the objects of other categories (both of the training and ofthe evaluation sets) rejected by the models developed withthe objects in the training set.˛% sensitivity (that of the final model with all the objectsin the training set and class boundary determined with �%confidence level);˛% specificity;forced model specificity (final model with all the objects in thetraining set and sensitivity forced to 100%).Efficiency is defined as the mean (here the geometrical mean)of sensitivity and specificity.

Sensitivity of URM and MRM is always 100%, by definition.o, MRM specificity corresponds to forced model specificity ofNEQ and SIMCA.

SIMCA and MRM CV models are generally smaller than the

odel built with all the objects. So, their specificity can be

arger than that of the final model. Therefore, specificity is thenly parameter that CV can overestimate. On the contrary, CVensitivity is generally worse than final sensitivity.

interpretation of the references to color in this figure legend,the reader is referred to the web version of the article.)

In the case of UNEQ CV models are about as large as the finalmodel, because of the characteristic of T2 statistics (Table 1).Fig. 4A shows as the 95% confidence ellipses in the five CVgroups and the final model (red) have almost the same size,

85 3.95455 6.28792 12.15547 21.9460390 3.94808 6.27058 12.09293 21.7074095 3.94235 6.25519 12.03754 21.49795

100 3.93710 6.24144 16.26522 21.31310

Page 6: Multivariate range modeling, a new technique for multivariate class modeling

90 a n a l y t i c a c h i m i c a a c t a 6 2 2 ( 2 0 0 8 ) 85–93

EQ 95%; B: UNEQ forced models; C: MRM tolerance 0.

Fig. 4 – Data set white wines. CV models of: A: UN

models forced to accept all the samples is very large, mainlybecause the hypothesis of multivariate normality obliges themodel to become larger also in the direction opposite to that ofthe object to be accepted. Fig. 4C shows the MRM models havea small size compared with those in Fig. 4B, so that a higherspecificity can be expected.

6. Results and discussion

All the reported results were obtained with modules of soft-ware PARVUS [14].

6.1. Artificial data

Some results are reported in Figs. 5 and 6. MRM modelsalways have specificity larger than UNEQ forced models andfrequently also than UNEQ 95% models. The difference of per-formance is very large when the ratio between the number ofobjects and the number of variables is small. When this ratioincreases (Fig. 5B) the UNEQ 95% model has the largest speci-ficity and the UNEQ forced model specificity approaches thatof the MRM model.

These results are the consequence of two factors: (a) MRMworks on the discriminant variable, the direction with themaximum separation between the categories; (b) UNEQ forcedmodels frequently broaden in the direction of the secondcategory also when the farthest objects are in the oppositedirection, as in the example of Fig. 3.

Sensitivity and specificity are experimental quantities, and

their confidence interval should be reported as for all theexperimental quantities. Never in the literature the uncer-tainty of these modeling parameter are reported. Also for theclassification and prediction performances the uncertainty

Fig. 5 – Specificity of the UNEQ and MRM models functionof the distance D between the categories, 1 variable. A: 20objects, B: 200 objects.

Page 7: Multivariate range modeling, a new technique for multivariate class modeling

a n a l y t i c a c h i m i c a a c t a 6 2 2 ( 2 0 0 8 ) 85–93 91

Fig. 6 – Specificity of the UNEQ and MRM models functionoo

ittfadTrum

trtda

6

TTrop

Table 2 – Results of modeling techniques, data setprocessionary

UNEQ SIMCA MRM MRM5

Mean % classificationrate

87.93 84.48 86.21 82.76

Mean % prediction rate 75.86 77.59 65.52 72.41Mean CV % sensitivity 91.38 82.76 53.45 63.79Mean % sensitivity 93.10 96.55Mean % CV specificity 26.55 50.00 78.62 69.31Mean % specificity 27.59 46.55Mean % specificity

forced model22.41 41.38 72.41 62.07

% Efficiency 50.68 67.04% CV efficiency 49.26 64.33 64.82 66.49% Efficiency forced

model47.34 64.33 85.10 78.78

MRM5 indicates the results obtained with tolerance 5.

Table 3 – Results of modeling techniques, data set Oliveoil

UNEQ SIMCA MRM MRM10

Mean % classificationrate

98.08 88.11 96.33 93.18

Mean % prediction rate 96.15 86.01 89.69 89.69Mean CV % sensitivity 90.21 85.31 69.76 84.62Mean % sensitivity 89.69 90.21Mean % CV specificity 96.13 96.64 99.23 98.28Mean % specificity 96.00 95.98Mean % specificity

forced model89.90 88.88 99.13 97.92

% Efficiency 92.79 93.05% CV efficiency 93.12 90.80 83.20 91.19% Efficiency forced

model94.82 94.27 99.56 98.96

MRM10 indicates the results obtained with tolerance 10.

Table 4 – Results of modeling techniques, data set Olivevarieties

UNEQ SIMCA MRM MRM6

Mean % classificationrate

96.88 93.30 98.66 94.20

Mean % prediction rate 85.78 85.71 84.82 81.25Mean CV % sensitivity 79.02 83.93 68.30 78.13

81.25 87.05Mean % CV specificity 84.24 83.53 91.70 88.26

81.70 79.24Mean % specificityforced model

45.54 34.60 89.51 85.04

81.47 83.06% CV efficiency 81.59 83.73 79.14 83.04

f the distance D between the categories, 10 variables. A: 20bjects, B: 200 objects.

s never reported, in spite there are statistics to computehem [15,16]. Figs. 7 and 8 show the 95% half interval forhe uncertainty of the specificity, for 95% UNEQ models andor MRM models, in the case of 1 and 10 variables. Easily anpproximate estimate of the uncertainty can be obtained forifferent number of variables by interpolation–extrapolation.he uncertainty is very large, especially for specificities in theange 20–80%. It depends very much on the number of objectssed in validation, as the uncertainty on classification perfor-ances.The results in Figs. 7 and 8 are valid on the hypothesis that

he statistical sample (the objects used to develop the model) isandomly extracted from a normal distribution. Surely, whenhe samples are selected, e.g. by means of a Kennard Stoneesign [17], among a lot of available samples to obtain anlmost uniform distribution, the uncertainty is lower.

.2. Real data

ables 2–5 show the results obtained with the real data sets.

hough class modeling is the objective of CMT, also theesults of classification are reported. UNEQ was used with theriginal variables, when possible, or with the principal com-onents (a minimum ratio 3 between the number of objects

% Efficiency forcedmodel

67.48 58.82 94.61 92.22

MRM6 indicates the results obtained with tolerance 6.

Page 8: Multivariate range modeling, a new technique for multivariate class modeling

92 a n a l y t i c a c h i m i c a a c t a 6 2 2 ( 2 0 0 8 ) 85–93

Fig. 7 – Error (95% confidence half interval) on specificity,

tance from the model D is zero for more than one category.

QDA, 95%. A: 1 variable, B: 10 variables.

and the number of variables in each category was alwaysrespected). SIMCA was always used with two componentsfor the inner space. Because SIMCA is a very flexible tech-nique (number of components, expansion/contraction of theranges) probably it should be possible to develop SIMCA mod-els with better performances. MRM was used with the originalvariables and with the discriminant LDA variables, with toler-ance 0 and with increasing values of the tolerance, to obtainmodels with large CV sensitivity without significant loss ofefficiency.

Always MRM final models are more efficient than UNEQ andSIMCA models. CV sensitivity is not large, due to the inherent

weakness of models based on the range when the distribu-tion of the objects is not uniform. Generally CV sensitivitydecreases with the increasing number of variables, because of

Fig. 8 – Error (95% confidence half interval) on specificity,MRM. A: 1 variable, B: 10 variables.

the increased probability that objects of the evaluation set areoutside the range evaluated with the training set. A suitabletolerance value improves CV sensitivity, but a too large tol-erance obviously decreases specificity. Table 5 shows that inthe case of Parmesan data CV sensitivity increases from 60.23to 95.45 with tolerance 35. When MRM is applied only withthe discriminant LDA variable CV sensitivity is 93.18, and itincreases up to 98.96 with tolerance 30.

MRM was born as a modeling technique. However, it canbe used also as a classification technique. The distance fromthe centroid is used to avoid ambiguity in the case the dis-

i

Generally, not always, classification performances of MRM(especially prediction ability) are worse than those of UNEQ.This result was confirmed with many other data sets. In the

Page 9: Multivariate range modeling, a new technique for multivariate class modeling

a n a l y t i c a c h i m i c a a c t a 6 2 2 ( 2 0 0 8 ) 85–93 93

Table 5 – Results of modeling techniques, data set Parmesan

UNEQ SIMCA MRM MRM35 MRM* MRM30*

Mean % classification rate 98.86 98.86 100.00 100.00 100.00 100.00Mean % prediction rate 96.55 96.59 100.00 98.86 98.86 98.86Mean CV % sensitivity 90.91 79.55 60.23 95.45 93.18 98.86Mean % sensitivity 92.05 88.64Mean % CV specificity 87.73 99.77 100.00 100.00 100.00 100.00Mean % specificity 95.45 100.00Mean % specificity forced model 71.59 95.45 100.00 100.00 100.00 100.00% Efficiency 93.73 94.15% CV efficiency 89.30 89.09 77.61 97.70 96.53 99.43% Efficiency forced model 84.61 97.70 100.00 100.00 100.00 100.00

d as p

ca

7

Me

((

(

dbtt

A

SU

rComputer Science, University of California, Irvine, Technical

MRM35 indicates the results obtained with tolerance 35.MRM* and MRM30* indicate that only the canonical variable was use

ase of SIMCA the two methods seem to be more or less equiv-lent.

. Conclusions

RM can be considered a powerful technique for class mod-ling problems, and, less for classification problems.

The main favorable characteristics are:

a) distribution free technique,b) 100% sensitivity, very important in many real problems,

and associated large efficiency, andc) mathematical simplicity, such that it can be easily under-

stood also by people with a limited background instatistics.

The performances of MRM are mainly due to the use ofiscriminant variables. Obviously, also UNEQ and SIMCA cane modified to work with discriminant variables to increaseheir specificity. The improvement of some details of MRM andhe modification of UNEQ and SIMCA will be studied.

cknowledgement

tudy developed with funds PRIN 2006 (National Ministry ofniversity and Research, University of Genova).

e f e r e n c e s

[1] S. Wold, M. Sjostrom, in: B.R. Kowalski (Ed.), Chemometrics:Theory and Applications, ACS Symposium Series, vol. 52,American Chemical Society, 1977, pp. 243–282.

redictor, with tolerance 0 or 30.

[2] M.P. Derde, D.L. Massart, Anal. Chim. Acta 184 (1986)33–51.

[3] H. Hotelling, in: C. Eisenhart, M.W. Hastay, W.A. Wallis (Eds.),Techniques of Statistical Analysis, McGraw-Hill, NY, 1947,pp. 111–184.

[4] M. Rosenblatt, Ann. Math. Stat. 27 (1956) 832–837.[5] D. Coomans, I. Broeckaert, Potential Pattern Recognition in

Chemical and Medical Decision Making, Reasearch StudiesPress, Letchworth, 1986.

[6] M. Forina, C. Armanino, R. Leardi, G. Drava, J. Chemom. 5(1991) 435–453.

[7] D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. De Jong,P.J. Lewi, J. Smeyers-Verbeke, Handbook of Chemometricsand Qualimetrics, Elsevier Science Publications, Amsterdam,1998, part A, p. 280.

[8] Data from WinesDB EU Project, R. Wittkowski, P. Brereton, E.Jamin, X. Capron, C. Guillou, M. Forina, U. Roemisch, V.Cotea, E. Kocsi, R. Schoula, F. van Jaarsveld.

[9] M. Tenenhaus, La regression PLS, Technip, Paris, 1998, p. 158.[10] M. Forina, E. Tiscornia, Annali Chim. (Rome) 72 (1982)

143–155.[11] J. Zupan, M. Novic, X. Li, J. Gasteiger, Anal. Chim. Acta 292

(1994) 219–234.[12] S. Lanteri, C. Armanino, E. Perri, A. Palopoli, Food Chem. 76

(2002) 501–507.[13] E. Resmini, L. Pellegrino, M. Bertuccioli, Riv. Soc. Ital. Sci.

Alim. 15 (1986) 315–326.[14] M. Forina, S. Lanteri, C. Armanino, C. Casolino, M. Casale, P.

Oliveri, PARVUS Release 2008, Dip. Chimica e TecnologieFarmaceutiche, University of Genova, available (free, withmanual and examples) from authors or athttp://www.parvus.unige.it.

[15] J.K. Martin, D.S. Hirschberg, Department of Information and

Report No 96-22, 1996, 1–3.[16] M. Forina, S. Lanteri, S. Rosso, Chemom. Intell. Lab Syst. 57

(2001) 121–132.[17] R.W. Kennard, L.A. Stone, Technometrics 11 (1969) 137–148.