Cross-validation of N earest-Neighbour Discriminant … Validation of...performing nearest-neighbour discriminant analysis and cross-validation. Each of these works well enough when

r--;"'-'P'-;lo':y;q":;lPAOf¥!i"JoQ¥PS56:t;;-'5""V~""""::;':::'''-''i;''''':h~~'''''<>~--~~:''':-c'-:: .'".- ~- -.. - -.- -.-.- ~ ~ ---;~-~- '.' -" '," . ----_. --

! .1

Cross-validation of N earest-Neighbour Discriminant Analysis

A.P. White! Computer Centre Elms Road University of Birmingham Edgbaston Birmingham B15 2TT United Kingdom

Abstract

The SAS statistical package contains a general-purpose discriminant proce

dure, DISCRIM. Among the options available for this procedure are ones for

performing nearest-neighbour discriminant analysis and cross-validation. Each

of these works well enough when used separately but, when the two options

are used together, an optimistic bias in cross-validated performance emerges.

For certain parameter values, this bias can be dramatically large. The cause of

the problem is analyzed mathematically for the two-class case with uniformly

distributed data and demonstrated by simulation for normal data. The cor

responding misbehaviour for multiple classes is also demonstrated by Monte

Carlo simulation. A modification to the procedure, which would remove the

bias, is proposed.

Key Words: Cross-validation; nearest neighbour discriminant analysis;

SAS; optimistic bias.

1 A.P. White is also an Associate Member of the School of Mathematics and Statistics at the University of Birmingham

829

t M t

~ ~ L r: ,<

~ :~

~ ;:.i g ~.

~ ~

~ ~,

~ -':'

" :". t' , , ',' ~'

, (

f:~ Ii L ! ~,

1: ~ I'

" .t. ~:r

ti l' ,-f

~ ~ ~.

K [r,

" f ~ ~ ~

r ~ 11 I:t ~ ~,

I ,

~ I ,

~

\ t \..

1 Introd uction

The general discriminant problem is one of deriving a model for classifying

observations of unknown class membership by making use of a set of obser

vations from the same population, whose class membership is known. To be

more specific, let S be a set of n observations, each of the form (w, x), where W

denotes membership of one of c classes. Each observation has measurements

on m variables, giving the vector x.

Let the prior probability for membership of class i be p(Wi). Also, let

the unconditional and class-specific probability densities at x be I(x) and

Ji(x), respectively. Bayes' theorem then gives the posterior probability that

an observation at x belongs to class i:

( .1 ) - p(Wi)!i(X) P W t x - I(x) (1)

Because the classes are mutually exclusive and jointly exhaustive, this can be

re-written as:

(2)

Classification of a new observation at x is then carried out from the

posterior probabilities. Thus x is predicted as belonging to class Wj if

p(wjlx) = maxi(P(wilx )).

Now, different approaches to discriminant analysis employ different meth

ods of estimating the class-specific probability densities, Ji(x). In the para

metric case, Fisher's linear discriminant analysis derives these estimates byas

suming a multivariate normal distribution for the data, based on class-specific

sample means and a pooled covariance matrix. (The quadratic version is

similar but uses separate covariance matrices for each class). The k-nearest

neighbour (k-NN) method, on the other hand, is nonparametric and makes

no such distributional assumptions. In this approach, a kernel is formed in

the measurement space. This kernel has the shape of a hypersphere and is

centred on x. The volume, V, of the kernel is such that it is just large enough

to contain k observations. Let ki of these observations belong to class Wi and

let there be ni observations in S, belonging to class Wi. Thus, summing over

all c classes gives E ki = k and E ni = n.

830

;'_-rr·:~·X.-.;~ - '-;;_~'.-'.--' .. "."- '." .'~ ..... ~-

r""9:.Q.:=;sK"'5'.. .... ;;;; .... ~~Y"&~-:::<:S~'"'---<..;?-G~-;;;.-""-...... -"'-"-'.,-~·--;: -"

~

I }

Hand (1981) shows how posterior probabilities can be estimated in such

a situation. The essence of his argument is as follows. The class-specific

probability density for class Wi at x is estimated by:

A k· hex) =-'

niV

Similarly, the unconditional probability density at x is estimated by:

A k I(x) =

nV

(3)

(4)

If the sample sizes, ni, are proportional to the prior probabilities, p(Wi), then

the priors can also be estimated by:

(5)

Substituting from Equations 3, 4 and 5 into 1 gives estimates for the posterior

probabilities:

p(wilx ) = ki (6) k

On the other hand, if the sample sizes are not proportional to the priors, then

an adjustment is required. In this case, let:

(7)

where the various Ti are adjustment factors for the lack of proportionality.

The estimates for the posterior probabilities now become:

(8)

2 Cross-validation Anomaly in SAS

The statistical package SAS contains a multi-purpose discriminant procedure,

called DISCRIM. This procedure has options for k-NN discriminant anal

ysis and for cross-validation. These options may be used in combination.

However, the way in which this is implemented in SAS is responsible for a

rather strange difficulty which arises under cross-validation, in the form of a

parameter-dependent bias in the cross-validated error rate estimate. In certain

831

~~ E ,. c·

1 '~

~:: t. ~;

~"

~~

~

~

f~ t";

~. ~~

:r , >..~ 2~ "J: ;-'.3" ~

l #, ~: )-; t f'

f; f l'l

~ '" ," "-'" {;

¥ i~ If 1~

?~

t k

i ~~ ~.

~

I ~:

~ .:

I ~

~

\ \.

\.

circumstances, this bias can be dramatically large. This anomaly is, perhaps,

best introduced by means of an example, leaving a more general treatment of

the problem until later in this paper.

Suppose that cross-validation is being performed on a data set in which

the measurement space consists of a single uniformly distributed variable, x,

and that observations belong to one of two equiprobable classes. Suppose that

x contains no information at all about class membership and that the sample

sizes nl and n2 are equal. Now, consider the behaviour of an algorithm,

operating as previously specified, with a parameter setting of k = 2. The

distribution of kernel membership over the cross-validation procedure will be

very nearly binomial (k,p), where p = nt/no In this case, p = 1/2. (The

distribution is not exactly binomial because p changes very slightly, according

to the actual class membership of ~he observation being classified but this

small detail is unimportant here).

The focus of interest is the consequences which follow from tied class

membership in the kernels. In this example, approximately half the kernels

would be expected to have one neighbour belonging to each class. Without

loss of generality, consider what happens when a member of class 1 is subjected

to cross-validatory classification in this situation, in order to estimate eev (the

cross-validated error rate). From Equation (3), it can be seen that:

and also that:

A 1 hex) = V(nl - 1)

A 1 h(x)=

Vn2

(9)

(10)

As the prior probabilities are equal, it follows that p(wllx) > p(w2Ix) and

the case will be classified correctly. Kernels with pure class membership will

obviously produce classification in the expected direction. This leads to the

expected value for the cross-validated error rate, E(ecv ), being only 1/4, rather

than 1/2 as expected under a random assignment of observations to predicted

classes. With these parameter values, it is easy to see that when k is changed,

the parity of k has a marked effect on the estimated error rate under cross

validation, because of the effect of ties in the kernel membership when k is

832

even but not when it is odd. Thus, for odd values of k, E(ecv ) = 1/2 but for

even values of k, an optimistic bias in E(ecv ) is clearly evident:

(for k even) (11)

Another disturbing feature of this approach is the relationship that emerges

between ers (the resubstitution error estimate) and ecv • Under resubstitution,

for the same parameter values, the effect of ties in kernel membership is dif

ferent. In these circumstances, the class-specific probability densities will be

exactly equal, leading to a tie in the posterior probabilities. In SAS these ties

are evaluated conservatively (i.e. as classification errors). Consequently,for

k > 1, the following relationship holds:

e(k+l) = e(k) rs cv (12)

(Of course, for k = 1, ers is zero, because each observation is its own nearest

neighbour). In fact, this relationship is quite general and can be shown to

hold for any number of equiprobable classes. Moving from cross-validation

with k nearest neighbours, to resubstitution with k + 1, increases the num

ber of neighbours of the same class as the test case by one. Because of the

different consequences of having ties for the majority in kernel membership

under resubstitution and cross-validation, this means that the judgement of

majority membership will not differ between the two schemes.

Now, all these strange properties follow from the fact that, for tied kernel

membership, the density estimates for the two classes are not equal (as might

be naively expected) but are biased in favour of the class to which the deleted

observation belongs. In the parametric situation, by contrast, this does not

happen. Consider a particular observation being classified using Fisher's linear

discriminant analysis. Suppose that, under resubstitution, the observation

lies at a point exactly mid-way between the two group means (and hence the

group-specific densities are equal). Under cross-validation, the group mean of

the class to which the deleted case belongs will have moved slightly farther

away from the observation itself (because this observation no longer makes

a contribution to the computation of the mean) and hence the clas~specific

833

density estimate will be somewhat lower for the true class, than for the other

class.

When the sample sizes are not equal, the effect of sample size on the

estimates of group-specific density and hence on the estimated posterior prob

abilities is easily calculated, for ties in kernel membership. The results are

summarised in Table 1. It can easily be seen that, for Inl - n21 ~ 1, there is

an optimistic bias in the classification behaviour under cross-validation. Out

side these limits, the mean error rate has the appropriate theoretical value but

the performance is markedly different for observations from the two different

classes.

Actual Class nl- n2 1 2

>1 wrong correct 1 tie correct 0 correct correct -1 correct tie

< -1 correct wrong

Table 1: Classification behaviour under cross-validation, for ties in kernel membership, as a function of differences in sample size. See text for further explanation.

3 Differences in Class Location

For uniformly distributed data, it is a simple matter to generalise the argument

just presented to the situation where there are differences in location between

two equiprobable classes. Let one class have uniformly distributed data lying

in the range (0, 1) and the other have uniform data in the range (s, 1 + s). Thus

s is the separation distance between the class means and the classes overlap

for a distance 1 - s on the data line. A Bayes' decision rule will give errors

only where the classes overlap in the data space. Thus, the Bayes' error rate,

eb is given simply by: 1

eb = -(1 - s) 2

834

(13)

r-QK-s...~~AG44i'm~""",,,-",::~~~~X£:(~'::':.'-.>"--'----"

i

!

In this situation, the k-NN classification rule should approximate to the Bayes'

rate, provided that k is small compared with the sample sizes. In SAS, under

cross-validation, this is the case only for odd values of k. For even values of

k, the effect of kernel ties in the overlap region produces an optimistic bias in

E(eev}. Just as before, this is clearly evident:

(k)} _ _ ( k ) ~ E(ecv - eb(1 k/2 2k} (for k even) (14)

4 Unequal Class Probabilities

Any attempt to extend the argument just presented to the situation involving

unequal class probabilities immediately runs into a complicating factor in the

performance of the nearest-neighbour algorithm. This arises from the fact that

the k-NN classification rule is optimal only for the special case of equal class

probabilities. Cover & Hart (1967) proved that, for any number of classes,

the single nearest-neighbour decision rule produces an error rate, e~), which

is bounded below by the Bayes' error rate and above by twice the Bayes' rate:

(15)

This is easily illustrated with the sort of discrimination problem under con

sideration in this paper. For a two-class problem, with the sample sizes pro

portional to the prior probabilities, then provided that the data are smoothly

distributed, the following error rate analysis is applicable. Within the region of

overlap, for an observation in class 1, the probability of a classification error is

simply the probability that its nearest neighbour in the data space is in class 2,

i.e. p(W2}. Likewise, the probability of mis-classifying a case in class 2 is p(Wl).

Without loss of generality, let p(Wl) be the smaller of the two class probabil

ities, i.e. p(Wl) ~ p(W2). Taking class separation into account, weighting the

classes by their prior probabilities and writing p(W2) as 1 - p(Wl) gives:

(16)

By contrast, the Bayes' error rate is given by the simple decision rule of clas

sifying according to the most frequent class. Hence:

(17)

835

r ....... - ............. ""'7""'''''''''"~=~~~'''t'-:.~~f.-;..::--~-"'--~'-:-i:"-;'<.:· -" ., . ~

\

\.

Clearly, E(e~!/) = eb when p(Wl) = 1/2 and tends to the value 2eb as p(Wl)

approaches zero.

Thus, it is easier to deal with the general values of k without reference to

the Bayes' error rate. For the general k-NN rule, within the region of overlap,

the expected error rates for odd values and even values of k can be obtained

from binomial expansions, as shown below. To simplify the notation, let the

prior probability of one of the classes be p. Then, by the same sort of argument

used earlier, odd values of k give:

(k-l)/2 (k) . . i=k (k). . . eo=p L i p'(I_p)k-t+(I_p) L i p'(I_p)k-1

i=O (k+1)/2

For even values of k, two quantities can be defined:

k/2-l (k) . . i=k (k) . . e~ = p ?:i pt(1 - p)k-1 + (1- p) L i pt(1 - p)k-,

t=O k/2+l

and

ee = e~ + ~ (k~2)pk/2(1- p)k/2

Thus the true expected cross-validated error rates are:

E(eC'l)) = (1 - s)eo (for k odd)

and

(for k even)

However, because of the way that kernel ties are evaluated in SAS:

E(ecv ) = (1 - s)e~ (for k even, in SAS)

(18)

(19)

(20)

(21)

(22)

(23)

Thus, for even values of k, an optimistic bias in the cross-validated error rate

is noticeable as:

836

5 The Normal Error Distribution

So far, the analysis in this paper of cross-validation of the k-NN algorithm in

SAS has examined its behaviour in dealing with the uniform error distribution

only. This distribution was chosen because of the simplicity that it lends to

the theoretical analysis. This simplicity arises from the fact that the range

boundaries allow the data space to be partitioned into regions which differ

strongly in class-specific density. Also, within each. region, the class-specific

densities are necessarily constant.

If we turn to examining behaviour with other distributions, such as the

normal, then theoretical analysis becomes more difficult because of the fact

that the class-specific densities change continuously with the value of x. Fur

thermore, the normal curve cannot be integrated analytically, which adds to

the problem. For this reason, it was thought preferable to examine the be

haviour of the algorithm with normally-distributed data by means of Monte

Carlo simulation.

Twelve thousand observations were drawn from the normal (0,1) distri

bution and two binary class membership indicators were simulated indepen

dently. One had exactly equal numbers of cases in each class, while the other

was arranged to have exactly two-thirds of the cases in the most frequent class.

Two situations were examined. In one, x was uncorrelated with class. In the

other, a new variable, xl, was derived from x, simply by adding the binary

class membership indicator, thereby introducing a class separation distance of

unity between the respective class means. Thus four conditions were arranged,

as follows:

• EQRAND: equiprobable classes and no class separation;

• EQDIST: equiprobable classes and unit class separation;

• NEQRAND: class membership odds of 2 : 1 and no class separation;

• NEQDIST: class membership odds of 2 : 1 and unit class separation;

For each condition, the classification performance of the k-NN discriminant

algorithm in SAS was estimated with the cross-validation option. Prior prob

abilities were estimated from the data. Each condition was tested using values

837

r== l I l

\

\.

of k from 1 to 12. For comparison purposes, Fisher's linear discriminant anal

ysis was also applied to the same data.

Experimental Condition k EQRAND EQDIST NEQRAND NEQDIST 1 0.4941 0.3901 0.4305 0.3546 2 0.2572 0.2042 0.2233 0.1851 3 0.4933 0.3591 0.4107 0.3223 4 0.3172 0.2424 0.2727 0.2156 5 0.4973 0.3481 0.3959 0.3088 6 0.3513 0.2588 0.2978 0.2319 7 0.4993 0.3402 0.3854 0.3007 8 0.3695 0.2739 0.3052 0.2445 9 0.4944 0.3345 0.3727 0.2983

10 0.3781 0.2797 0.3120 0.2485 11 0.5018 0.3357 0.3665 0.2967 12 0.3923 0.2874 0.3163 0.2547

paramo 0.4988 0.3051 0.3333 0.2788 Bayes 0.5000 0.3085 0;3333 0.2698

Table 2: Cross-validation estimates of error rates as a function of k and experimental condition, obtained in Monte Carlo simulation. The parametric crossvalidation estimates for Fisher's linear discriminant analysis are also given, in addition to the Bayes error rate. See text for further explanation.

The resulting error rates are shown in Table 2. For each experimental

condition, there is the same type of optimistic bias in E(ecv ), for even values of

k, as was deduced analytically for the uniform error distribution. It is obvious

that these estimates are impossibly optimistic because they are smaller than

the Bayes' rate, most noticeably for k = 2. These results are hardly surprising,

because the essence of the problem lies in the high frequency of kernel ties and

the way that they are evaluated in SAS and is not dependent on the particular

within-class error distribution.

838

6 Extension to Multiple Classes by Monte Carlo Simulation

If more than two classes are considered, the position becomes more complex

because many different possibilities for ties emerge for majority kernel mem

bership. For example, if c = 3 and k = 6, a three-way tie is possible, as

well as three possible two-way ties. Also, ties emerge even when k is not a

multiple of c. For example, with k = 6 and four classes, a kernel might have

two observations from each of two classes and a single observation from each

of the other two classes.

Thus, an exact analytic approach to examining the effect of ties for more

than two classes is extremely tedious and not worth the trouble. For this

reason, a simple Monte Carlo simulation was performed, as follows. Twelve

thousand observations were drawn from a uniform distribution and class mem

bership indicator variables for 2, 3, 4, 5 and 6 classes were simulated (inde

pendently of the continuous variable), so as to have exactly equal numbers of

observations in each class. Thus five data sets were generated, all sharing the

same independent variable, which conveyed no information about class mem

bership. The classification performance of k-NN discriminant analysis was

estimated using the DISCRIM procedure in SAS, with the cross-validation

option. Each data set was tested using values of k from 1 to 12. The resulting

error rates are shown in Table 3. The following points should be noted.

1. The error rates for c = 2 are as expected from Equation 11.

2. In all cases where k = 1, the error rates approximate closely to the

theoretical expected values.

3. In all situations where c > 2 and k > 1, the results show clearly that

the estimated error rates are substantially lower than the corresponding

theoretical expected values.

4. Resubstitution estimates of the error rates were also recorded. For k > 1,

they confirmed exactly the relationship with the cross-validation esti

mates stated in Equation 12.

839

N umber of Classes k 2 3 4 5 6 1 0.4966 0.6675 0.7513 0.8004 0.8391 2 0.2514 0.4464 0.5598 0.6394 0.7016 3 0.5040 0.5134 0.5638 0.6123 0.6523 4 0.3155 0.5821 0.6442 0.6566 0.6583 5 0.4983 0.5381 0.6617 0.7035 0.7388 6 0.3465 0.5486 0.6318 0.6954 0.7616 7 0.4966 0.5964 0.6437 0.6852 0.7512 8 0.3663 0.5618 0.6560 0.7023 0.7433 9 0.4943 0.5729 0.6638 0.7166 0.7488 10 0.3759 0.5998 0.6549 0.7227 0.7666 11 0.5022 0.5798 0.6620 0.7221 0.7729 12 0.3919 0.5838 0.6663 0.7225 0.7680

paramo 0.4967 0.6611 0.7448 0.7973 0.8311 Bayes 0.5000 0.6667 0.7500 0.8000 0.8333

Table 3: Cross-validation estimates of error rates as a function of k and number of classes obtained in Monte Carlo simulation. The parametric cross-validation estimates for Fisher's linear discriminant analysis are also given, in addition to the Bayes error rate. See text for further explanation.

7 A Possible Remedy

The basis of the problem lies in peculiarities of the density estimation proce

dure in the k-NN algorithm under cross-validation, compounded by the high

frequency of kernel ties. However, it is possible to compensate for this by

making adjustments to the estimates of the prior probabilities. Hence, one so

lution to the problem is to estimate the prior probabilities from the data after

case deletion, rather than fix them from the outset as is done conventionally. 2

If this course of action is taken then, under cross-validation, if the deleted

case belongs to class i, the prior probability for membership of class i is then

estimated by: A( ) ni - 1 (25) PWi =--

n-l

2Note that this adaptation is proposed for the nonparametric algorithm only.

840

However, the prior probability for membership of any of the other classes, j,

is given by: n·

p(Wj) = n~ 1 (26)

The corresponding class-specific densities at x are estimated as:

~ ki fi(x) = V(ni - 1) (27)

and

j.(x) = l5.L J Vn.

J (28)

and the unconditional density is estimated by:

~ k f(x) = (n - l)V (29)

Thus the corresponding estimated posterior probabilities become:

~( I) ki P Wi x =-k

(30)

and k·

p(wjlx) = -.2 (31) k

In these circumstances, if there is a tie for majority kernel membership in

volving the class to which the deleted case belongs, then there will also be

a tie in the estimated posterior probabilities. It is proposed that a random

classification choice is made between the tied classes in these cases. If this is

done, then the resulting error rate will be essentially unbiased.

Of course, the approach just described is appropriate only when the sam

ples have been drawn so as to be representative of the populations that they

are intended to represent. If the priors are non-proportional, then this ap

proach needs to be modified. In this situation, the priors must be specified

initially by the user. To begin with, this additional information is ignored and

the computation is performed as previously specified, up to the point where

the posterior probabilities are estimated. However, the posterior probabilities

then need to be adjusted for the lack of proportionality before the assignment

to classes is made. Thus, if 7ri is the user-specified prior for class Wi, then the

adjustment factor required is:

(32)

841

\

\,

The appropriately adjusted estimate of the posterior probability is then given

by Equation 8.

8 Conclusion

This is an interesting example of a problem occurring in statistical software

which is caused, not by a computing error, but by a mathematical one. The mis

behaviour of the k-NN algorithm under cross-validation is entirely deducible

from the mathematics given in the SAS manual, (SAS Institute, 1989). In this

respect, the nature of the problem is similar to the one reported by White &

Liu (1993), in which a stepwise discriminant algorithm is improperly cross

validated. In both cases, the respective problems arose because of lack of

consideration of the effects of combining techniques. In the case of SAS, the

k-NN algorithm works well enough when considered in isolation and so does

the cross-validation technique. The difficulty arises when the two techniques

are used in combination. Apart from SAS, nonparametric discriminant tech

niques are not available in the commonly used statistical software with which

the author is familiar. Hence, problems encountered by combining these two

techniques do not seem to have been encountered elsewhere.

The solution offered in this paper keeps as close as possible to the origi

nal philosophies of both cross-validation and k-NN discriminant analysis. It

involves estimating the prior probabilities from the data after the case dele

tion which forms part of the cross-validation procedure. The only possibly

contentious aspect is the proposed use of random choice between predicted

classes in the case of ties in posterior probabilities. One feature of this ap

proach is that the procedure is non-repeatable. However, there are precedents

for this type of procedure. Tocher (1950) proposed a modification to Fisher's

exact probability test which utilised random choice in order to achieve spec

ified a values for significance testing purposes. Also, the use of approximate

randomization techniques for conducting significance tests has been described

by Edgington (1980), Still & White (1981) and White & Still (1984, 1987).

The proposal here is to make use of random choice to achieve an unbiased

estimate of classification performance when kernel ties are encountered.

842

-_ .. _ .. _--- -- --.- -.- -~.---

.-.- ----_." -

Acknowledgements

The author would like to thank Prof. J.B. Copas (from the Department of

Statistics at the University of Warwick) and Prof. A.J. Lawrance and Dr. P.

Davies (both from the School of Mathematics and Statistics, at the University

of Birmingham) for their helpful comments.

References

Cover, T.M. and Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, IT-13, 21-27.

Edgington, E.S. (1980). Randomization Tests, New York: Marcel Dekker.

Hand, D.J. (1981). Discrimination and Classification, New York: John Wiley & Sons Ltd.

SAS Institute Inc. (1989). SAS/STAT User's Guide, Version 6, Fourth Edition, Volume 1, Cary, NC, USA: SAS Institute Inc.

Still, A.W. and White, A.P. (1981). The approximate randomization test as an alternative to the F test in analysis of variance. British Journal of Mathematical and Statistical Psychology, 34, 243-252.

Tocher, K.D. (1950). Extension of the Neyman-Pearson theory of tests to discontinuous variates. Biometrika, 37, 130-144.

White, A.P. and Liu, W.Z. (1993). The jackknife with a stepwise discriminant algorithm - a warning to BMDP users. Journal of Applied Statistics, 20, (1), 187-190.

White, A.P. and Still, A.W. (1984). Monte Carlo analysis of variance. In Proceedings of the Sixth Symposium in Computational Statistics (Prague). Vienna: Physica-Verlag.

White, A.P. and Still, A.W. (1987). Monte Carlo randomization tests: A reply to Bradbury. British Journal of Mathematical and Statistical Psychology, 40, 188-191.

843

Cross-validation of N earest-Neighbour Discriminant … Validation of...performing nearest-neighbour discriminant analysis and cross-validation. Each of these works well enough when

Documents