.'".- -.. - -.- -.-.- -- - '.' -" '," . ----_. -- ! .1 Cross-validation of Nearest-Neighbour Discriminant Analysis A.P. White! Computer Centre Elms Road University of Birmingham Edgbaston Birmingham B15 2TT United Kingdom Abstract The SAS statistical package contains a general-purpose discriminant proce- dure, DISCRIM. Among the options available for this procedure are ones for performing nearest-neighbour discriminant analysis and cross-validation. Each of these works well enough when used separately but, when the two options are used together, an optimistic bias in cross-validated performance emerges. For certain parameter values, this bias can be dramatically large. The cause of the problem is analyzed mathematically for the two-class case with uniformly distributed data and demonstrated by simulation for normal data. The cor- responding misbehaviour for multiple classes is also demonstrated by Monte Carlo simulation. A modification to the procedure, which would remove the bias, is proposed. Key Words: Cross-validation; nearest neighbour discriminant analysis; SAS; optimistic bias. 1 A.P. White is also an Associate Member of the School of Mathematics and Statistics at the University of Birmingham 829
15
Embed
Cross-validation of N earest-Neighbour Discriminant … Validation of...performing nearest-neighbour discriminant analysis and cross-validation. Each of these works well enough when
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Hand (1981) shows how posterior probabilities can be estimated in such
a situation. The essence of his argument is as follows. The class-specific
probability density for class Wi at x is estimated by:
A k· hex) =-'
niV
Similarly, the unconditional probability density at x is estimated by:
A k I(x) =
nV
(3)
(4)
If the sample sizes, ni, are proportional to the prior probabilities, p(Wi), then
the priors can also be estimated by:
(5)
Substituting from Equations 3, 4 and 5 into 1 gives estimates for the posterior
probabilities:
p(wilx ) = ki (6) k
On the other hand, if the sample sizes are not proportional to the priors, then
an adjustment is required. In this case, let:
(7)
where the various Ti are adjustment factors for the lack of proportionality.
The estimates for the posterior probabilities now become:
(8)
2 Cross-validation Anomaly in SAS
The statistical package SAS contains a multi-purpose discriminant procedure,
called DISCRIM. This procedure has options for k-NN discriminant anal
ysis and for cross-validation. These options may be used in combination.
However, the way in which this is implemented in SAS is responsible for a
rather strange difficulty which arises under cross-validation, in the form of a
parameter-dependent bias in the cross-validated error rate estimate. In certain
831
~~ E ,. c·
1 '~
~:: t. ~;
~"
~~
~
~
f~ t";
~. ~~
:r , >..~ 2~ "J: ;-'.3" ~
l #, ~: )-; t f'
f; f l'l
~ '" ," "-'" {;
¥ i~ If 1~
?~
t k
i ~~ ~.
~
I ~:
~ .:
I ~
~
\ \.
\.
circumstances, this bias can be dramatically large. This anomaly is, perhaps,
best introduced by means of an example, leaving a more general treatment of
the problem until later in this paper.
Suppose that cross-validation is being performed on a data set in which
the measurement space consists of a single uniformly distributed variable, x,
and that observations belong to one of two equiprobable classes. Suppose that
x contains no information at all about class membership and that the sample
sizes nl and n2 are equal. Now, consider the behaviour of an algorithm,
operating as previously specified, with a parameter setting of k = 2. The
distribution of kernel membership over the cross-validation procedure will be
very nearly binomial (k,p), where p = nt/no In this case, p = 1/2. (The
distribution is not exactly binomial because p changes very slightly, according
to the actual class membership of ~he observation being classified but this
small detail is unimportant here).
The focus of interest is the consequences which follow from tied class
membership in the kernels. In this example, approximately half the kernels
would be expected to have one neighbour belonging to each class. Without
loss of generality, consider what happens when a member of class 1 is subjected
to cross-validatory classification in this situation, in order to estimate eev (the
cross-validated error rate). From Equation (3), it can be seen that:
and also that:
A 1 hex) = V(nl - 1)
A 1 h(x)=
Vn2
(9)
(10)
As the prior probabilities are equal, it follows that p(wllx) > p(w2Ix) and
the case will be classified correctly. Kernels with pure class membership will
obviously produce classification in the expected direction. This leads to the
expected value for the cross-validated error rate, E(ecv ), being only 1/4, rather
than 1/2 as expected under a random assignment of observations to predicted
classes. With these parameter values, it is easy to see that when k is changed,
the parity of k has a marked effect on the estimated error rate under cross
validation, because of the effect of ties in the kernel membership when k is
832
even but not when it is odd. Thus, for odd values of k, E(ecv ) = 1/2 but for
even values of k, an optimistic bias in E(ecv ) is clearly evident:
(for k even) (11)
Another disturbing feature of this approach is the relationship that emerges
between ers (the resubstitution error estimate) and ecv • Under resubstitution,
for the same parameter values, the effect of ties in kernel membership is dif
ferent. In these circumstances, the class-specific probability densities will be
exactly equal, leading to a tie in the posterior probabilities. In SAS these ties
are evaluated conservatively (i.e. as classification errors). Consequently,for
k > 1, the following relationship holds:
e(k+l) = e(k) rs cv (12)
(Of course, for k = 1, ers is zero, because each observation is its own nearest
neighbour). In fact, this relationship is quite general and can be shown to
hold for any number of equiprobable classes. Moving from cross-validation
with k nearest neighbours, to resubstitution with k + 1, increases the num
ber of neighbours of the same class as the test case by one. Because of the
different consequences of having ties for the majority in kernel membership
under resubstitution and cross-validation, this means that the judgement of
majority membership will not differ between the two schemes.
Now, all these strange properties follow from the fact that, for tied kernel
membership, the density estimates for the two classes are not equal (as might
be naively expected) but are biased in favour of the class to which the deleted
observation belongs. In the parametric situation, by contrast, this does not
happen. Consider a particular observation being classified using Fisher's linear
discriminant analysis. Suppose that, under resubstitution, the observation
lies at a point exactly mid-way between the two group means (and hence the
group-specific densities are equal). Under cross-validation, the group mean of
the class to which the deleted case belongs will have moved slightly farther
away from the observation itself (because this observation no longer makes
a contribution to the computation of the mean) and hence the clas~specific
833
density estimate will be somewhat lower for the true class, than for the other
class.
When the sample sizes are not equal, the effect of sample size on the
estimates of group-specific density and hence on the estimated posterior prob
abilities is easily calculated, for ties in kernel membership. The results are
summarised in Table 1. It can easily be seen that, for Inl - n21 ~ 1, there is
an optimistic bias in the classification behaviour under cross-validation. Out
side these limits, the mean error rate has the appropriate theoretical value but
the performance is markedly different for observations from the two different
classes.
Actual Class nl- n2 1 2
>1 wrong correct 1 tie correct 0 correct correct -1 correct tie
< -1 correct wrong
Table 1: Classification behaviour under cross-validation, for ties in kernel membership, as a function of differences in sample size. See text for further explanation.
3 Differences in Class Location
For uniformly distributed data, it is a simple matter to generalise the argument
just presented to the situation where there are differences in location between
two equiprobable classes. Let one class have uniformly distributed data lying
in the range (0, 1) and the other have uniform data in the range (s, 1 + s). Thus
s is the separation distance between the class means and the classes overlap
for a distance 1 - s on the data line. A Bayes' decision rule will give errors
only where the classes overlap in the data space. Thus, the Bayes' error rate,
Table 2: Cross-validation estimates of error rates as a function of k and experimental condition, obtained in Monte Carlo simulation. The parametric crossvalidation estimates for Fisher's linear discriminant analysis are also given, in addition to the Bayes error rate. See text for further explanation.
The resulting error rates are shown in Table 2. For each experimental
condition, there is the same type of optimistic bias in E(ecv ), for even values of
k, as was deduced analytically for the uniform error distribution. It is obvious
that these estimates are impossibly optimistic because they are smaller than
the Bayes' rate, most noticeably for k = 2. These results are hardly surprising,
because the essence of the problem lies in the high frequency of kernel ties and
the way that they are evaluated in SAS and is not dependent on the particular
within-class error distribution.
838
6 Extension to Multiple Classes by Monte Carlo Simulation
If more than two classes are considered, the position becomes more complex
because many different possibilities for ties emerge for majority kernel mem
bership. For example, if c = 3 and k = 6, a three-way tie is possible, as
well as three possible two-way ties. Also, ties emerge even when k is not a
multiple of c. For example, with k = 6 and four classes, a kernel might have
two observations from each of two classes and a single observation from each
of the other two classes.
Thus, an exact analytic approach to examining the effect of ties for more
than two classes is extremely tedious and not worth the trouble. For this
reason, a simple Monte Carlo simulation was performed, as follows. Twelve
thousand observations were drawn from a uniform distribution and class mem
bership indicator variables for 2, 3, 4, 5 and 6 classes were simulated (inde
pendently of the continuous variable), so as to have exactly equal numbers of
observations in each class. Thus five data sets were generated, all sharing the
same independent variable, which conveyed no information about class mem
bership. The classification performance of k-NN discriminant analysis was
estimated using the DISCRIM procedure in SAS, with the cross-validation
option. Each data set was tested using values of k from 1 to 12. The resulting
error rates are shown in Table 3. The following points should be noted.
1. The error rates for c = 2 are as expected from Equation 11.
2. In all cases where k = 1, the error rates approximate closely to the
theoretical expected values.
3. In all situations where c > 2 and k > 1, the results show clearly that
the estimated error rates are substantially lower than the corresponding
theoretical expected values.
4. Resubstitution estimates of the error rates were also recorded. For k > 1,
they confirmed exactly the relationship with the cross-validation esti
Table 3: Cross-validation estimates of error rates as a function of k and number of classes obtained in Monte Carlo simulation. The parametric cross-validation estimates for Fisher's linear discriminant analysis are also given, in addition to the Bayes error rate. See text for further explanation.
7 A Possible Remedy
The basis of the problem lies in peculiarities of the density estimation proce
dure in the k-NN algorithm under cross-validation, compounded by the high
frequency of kernel ties. However, it is possible to compensate for this by
making adjustments to the estimates of the prior probabilities. Hence, one so
lution to the problem is to estimate the prior probabilities from the data after
case deletion, rather than fix them from the outset as is done conventionally. 2
If this course of action is taken then, under cross-validation, if the deleted
case belongs to class i, the prior probability for membership of class i is then
estimated by: A( ) ni - 1 (25) PWi =--
n-l
2Note that this adaptation is proposed for the nonparametric algorithm only.
840
However, the prior probability for membership of any of the other classes, j,
is given by: n·
p(Wj) = n~ 1 (26)
The corresponding class-specific densities at x are estimated as:
~ ki fi(x) = V(ni - 1) (27)
and
j.(x) = l5.L J Vn.
J (28)
and the unconditional density is estimated by:
~ k f(x) = (n - l)V (29)
Thus the corresponding estimated posterior probabilities become:
~( I) ki P Wi x =-k
(30)
and k·
p(wjlx) = -.2 (31) k
In these circumstances, if there is a tie for majority kernel membership in
volving the class to which the deleted case belongs, then there will also be
a tie in the estimated posterior probabilities. It is proposed that a random
classification choice is made between the tied classes in these cases. If this is
done, then the resulting error rate will be essentially unbiased.
Of course, the approach just described is appropriate only when the sam
ples have been drawn so as to be representative of the populations that they
are intended to represent. If the priors are non-proportional, then this ap
proach needs to be modified. In this situation, the priors must be specified
initially by the user. To begin with, this additional information is ignored and
the computation is performed as previously specified, up to the point where
the posterior probabilities are estimated. However, the posterior probabilities
then need to be adjusted for the lack of proportionality before the assignment
to classes is made. Thus, if 7ri is the user-specified prior for class Wi, then the
adjustment factor required is:
(32)
841
\
\,
The appropriately adjusted estimate of the posterior probability is then given
by Equation 8.
8 Conclusion
This is an interesting example of a problem occurring in statistical software
which is caused, not by a computing error, but by a mathematical one. The mis
behaviour of the k-NN algorithm under cross-validation is entirely deducible
from the mathematics given in the SAS manual, (SAS Institute, 1989). In this
respect, the nature of the problem is similar to the one reported by White &
Liu (1993), in which a stepwise discriminant algorithm is improperly cross
validated. In both cases, the respective problems arose because of lack of
consideration of the effects of combining techniques. In the case of SAS, the
k-NN algorithm works well enough when considered in isolation and so does
the cross-validation technique. The difficulty arises when the two techniques
are used in combination. Apart from SAS, nonparametric discriminant tech
niques are not available in the commonly used statistical software with which
the author is familiar. Hence, problems encountered by combining these two
techniques do not seem to have been encountered elsewhere.
The solution offered in this paper keeps as close as possible to the origi
nal philosophies of both cross-validation and k-NN discriminant analysis. It
involves estimating the prior probabilities from the data after the case dele
tion which forms part of the cross-validation procedure. The only possibly
contentious aspect is the proposed use of random choice between predicted
classes in the case of ties in posterior probabilities. One feature of this ap
proach is that the procedure is non-repeatable. However, there are precedents
for this type of procedure. Tocher (1950) proposed a modification to Fisher's
exact probability test which utilised random choice in order to achieve spec
ified a values for significance testing purposes. Also, the use of approximate
randomization techniques for conducting significance tests has been described
by Edgington (1980), Still & White (1981) and White & Still (1984, 1987).
The proposal here is to make use of random choice to achieve an unbiased
estimate of classification performance when kernel ties are encountered.
842
-_ .. _ .. _--- -- --.- -.- -~.---
.-.- ----_." -
Acknowledgements
The author would like to thank Prof. J.B. Copas (from the Department of
Statistics at the University of Warwick) and Prof. A.J. Lawrance and Dr. P.
Davies (both from the School of Mathematics and Statistics, at the University
of Birmingham) for their helpful comments.
References
Cover, T.M. and Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, IT-13, 21-27.
Edgington, E.S. (1980). Randomization Tests, New York: Marcel Dekker.
Hand, D.J. (1981). Discrimination and Classification, New York: John Wiley & Sons Ltd.
SAS Institute Inc. (1989). SAS/STAT User's Guide, Version 6, Fourth Edition, Volume 1, Cary, NC, USA: SAS Institute Inc.
Still, A.W. and White, A.P. (1981). The approximate randomization test as an alternative to the F test in analysis of variance. British Journal of Mathematical and Statistical Psychology, 34, 243-252.
Tocher, K.D. (1950). Extension of the Neyman-Pearson theory of tests to discontinuous variates. Biometrika, 37, 130-144.
White, A.P. and Liu, W.Z. (1993). The jackknife with a stepwise discriminant algorithm - a warning to BMDP users. Journal of Applied Statistics, 20, (1), 187-190.
White, A.P. and Still, A.W. (1984). Monte Carlo analysis of variance. In Proceedings of the Sixth Symposium in Computational Statistics (Prague). Vienna: Physica-Verlag.
White, A.P. and Still, A.W. (1987). Monte Carlo randomization tests: A reply to Bradbury. British Journal of Mathematical and Statistical Psychology, 40, 188-191.