Relating ROC and CMC Curves via the Biometric Menagerie Brian DeCann #1 and Arun Ross *2 # Lane Department of Computer Science and Electrical Engineering, West Virginia University 1 [email protected]* Department of Computer Science and Engineering, Michigan State University 2 [email protected]Abstract In the academic literature, the matching accuracy of a biometric system is typically quantified through measures such as the Receiver Operating Characteristic (ROC) curve and Cumulative Match Characteristic (CMC) curve. The ROC curve, measuring verification performance, is based on aggregate statistics of match scores corresponding to all biometric samples, while the CMC curve, measuring identification performance, is based on the relative order- ing of match scores corresponding to each biometric sam- ple (in closed-set identification). In this study, we determine whether a set of genuine and impostor match scores gener- ated from biometric data can be reassigned to virtual iden- tities, such that the same ROC curve can be accompanied by multiple CMC curves. The reassignment is accomplished by modeling the intra- and inter-class relationships between identities based on the “Doddington Zoo” or “Biometric Menagerie” phenomenon. The outcome of the study sug- gests that a single ROC curve can be mapped to multiple CMC curves in closed-set identification, and that presen- tation of a CMC curve should be accompanied by a ROC curve when reporting biometric system performance, in or- der to better understand the performance of the matcher. 1. Introduction Biometrics is the science of recognizing humans based on the physical or behavioral traits of an individual. Ex- amples of these traits include face, fingerprint, iris, hand geometry, voice, and gait [11, 12]. A biometric system typ- ically operates in either verification mode or identification mode [12]. In verification, the probe biometric data is sub- mitted along with a claimed identity. To validate the identity claim, the system compares the probe data strictly with sim- ilarly labeled identities stored in a reference database. The output of a verification operation is a match or non-match. This sort of matching is also referred as 1:1 matching, as the probe is compared against a single (or relatively small) number of reference entities. In identification, the probe biometric data is not labeled with any identity. Therefore, in order to determine the iden- tity of the probe, the system compares the probe against ev- ery reference identity. The output of an identification op- eration is a sorted list of identities, ordered from the best match to the worst match. This type of matching operation is also referred as 1:N matching, with N being the size of the reference database. The identification operation can be either closed-set or open-set. In closed-set identification, the identity of the input probe is known to be present in the reference database. However, in open-set identification, the identity corresponding to the probe may or may not be in the reference database. 1.1. Measuring Biometric System Performance The performance of a biometric matcher, operating in the verification or identification mode, can be evaluated based on the match scores generated from test biometric data. In a set of test data, let N be the number of identities and N G be the number of biometric samples (e.g., face im- ages) per identity. The total number of samples is N T (i.e., N T = N · N G ). By comparing each of N T samples against the remaining N T − 1 samples and assuming a symmetric matcher, a total of 1 2 N T (N T − 1) similarity match scores can be computed. Define this procedure as an “all-to-all” match test. In computing the match scores for an “all-to- all” match test, two classes of match scores are generated: genuine match scores and impostor match scores. Genuine match scores denote the scores generated when comparing two biometric samples belonging to the same individual. Impostor scores denote the scores generated when matching two biometric samples belonging to different individuals. The total number of genuine and impostor scores that can be computed are N ( NG 2 ) and N G 2 ( N 2 ) , respectively. Us- ing the generated match scores, a pair of probability density functions regarding the likelihood of observing a genuine or impostor score with a certain value can be estimated. De- 1 Appeared in Proc. of 6th IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS), (Washington DC, USA), September 2013
8
Embed
Appeared in Proc. of 6th IEEE International …rossarun/pubs/DeCannRossROC-CMC...and Cumulative Match Characteristic (CMC) curve. The ROC curve, measuring verification performance,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Relating ROC and CMC Curves via the Biometric Menagerie
Brian DeCann #1 and Arun Ross ∗2
# Lane Department of Computer Science and Electrical Engineering, West Virginia [email protected]
∗Department of Computer Science and Engineering, Michigan State [email protected]
Abstract
In the academic literature, the matching accuracy of a
biometric system is typically quantified through measures
such as the Receiver Operating Characteristic (ROC) curve
and Cumulative Match Characteristic (CMC) curve. The
ROC curve, measuring verification performance, is based
on aggregate statistics of match scores corresponding to
all biometric samples, while the CMC curve, measuring
identification performance, is based on the relative order-
ing of match scores corresponding to each biometric sam-
ple (in closed-set identification). In this study, we determine
whether a set of genuine and impostor match scores gener-
ated from biometric data can be reassigned to virtual iden-
tities, such that the same ROC curve can be accompanied by
multiple CMC curves. The reassignment is accomplished by
modeling the intra- and inter-class relationships between
identities based on the “Doddington Zoo” or “Biometric
Menagerie” phenomenon. The outcome of the study sug-
gests that a single ROC curve can be mapped to multiple
CMC curves in closed-set identification, and that presen-
tation of a CMC curve should be accompanied by a ROC
curve when reporting biometric system performance, in or-
der to better understand the performance of the matcher.
1. Introduction
Biometrics is the science of recognizing humans based
on the physical or behavioral traits of an individual. Ex-
amples of these traits include face, fingerprint, iris, hand
geometry, voice, and gait [11, 12]. A biometric system typ-
ically operates in either verification mode or identification
mode [12]. In verification, the probe biometric data is sub-
mitted along with a claimed identity. To validate the identity
claim, the system compares the probe data strictly with sim-
ilarly labeled identities stored in a reference database. The
output of a verification operation is a match or non-match.
This sort of matching is also referred as 1:1 matching, as
the probe is compared against a single (or relatively small)
number of reference entities.
In identification, the probe biometric data is not labeled
with any identity. Therefore, in order to determine the iden-
tity of the probe, the system compares the probe against ev-
ery reference identity. The output of an identification op-
eration is a sorted list of identities, ordered from the best
match to the worst match. This type of matching operation
is also referred as 1:N matching, with N being the size of
the reference database. The identification operation can be
either closed-set or open-set. In closed-set identification,
the identity of the input probe is known to be present in the
reference database. However, in open-set identification, the
identity corresponding to the probe may or may not be in
the reference database.
1.1. Measuring Biometric System Performance
The performance of a biometric matcher, operating in the
verification or identification mode, can be evaluated based
on the match scores generated from test biometric data. In
a set of test data, let N be the number of identities and
NG be the number of biometric samples (e.g., face im-
ages) per identity. The total number of samples is NT (i.e.,
NT = N ·NG). By comparing each of NT samples against
the remaining NT − 1 samples and assuming a symmetric
matcher, a total of 1
2NT (NT − 1) similarity match scores
can be computed. Define this procedure as an “all-to-all”
match test. In computing the match scores for an “all-to-
all” match test, two classes of match scores are generated:
genuine match scores and impostor match scores. Genuine
match scores denote the scores generated when comparing
two biometric samples belonging to the same individual.
Impostor scores denote the scores generated when matching
two biometric samples belonging to different individuals.
The total number of genuine and impostor scores that can
be computed are N(
NG
2
)
and NG2(
N
2
)
, respectively. Us-
ing the generated match scores, a pair of probability density
functions regarding the likelihood of observing a genuine or
impostor score with a certain value can be estimated. De-
1
Appeared in Proc. of 6th IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS), (Washington DC, USA), September 2013
note the genuine and impostor score distributions as fG(s)and fI(s), respectively.
Verification performance is typically evaluated by as-
sessing the false match rate (FMR) and the false non-match
rate (FNMR). The FMR denotes the percentage of impos-
tor scores that exceed a numerical threshold t and are incor-
rectly classified as matches. The FNMR denotes the per-
centage of genuine scores that are below a threshold t and
are incorrectly classified as non-matches. Graphically, the
FMR and FNMR are often expressed by a Receiver Oper-
ating Characteristic (ROC) curve. The ROC curve plots 1-
FNMR versus FMR by varying the threshold t. As such,
we refer to FMR, FNMR, and the ROC curve as aggregate-
based metrics.
When evaluating identification performance, a set of
Nprobe probe samples is compared against a set of Nref
reference samples, resulting in Nprobe sets of match scores,
with each set containing Nref match scores. The match
scores in each set are ordered from highest to lowest. In
open-set identification, these sets are used to assess the false
positive identification rate (FPIR) and true positive identi-
fication rate (TPIR) [8]. The FPIR is defined as the pro-
portion of times a probe that does not have a correspond-
ing reference identity (i.e., no genuine scores were gener-
ated), generates an impostor score exceeding the value of a
threshold, t. The TPIR is defined as the proportion of times
a probe that does have a corresponding reference identity
(i.e., genuine scores were generated), the correct identity is
observed within the top K (K ≤ N ) ranks (i.e., a genuine
score occurs within the top K sorted scores in the set) and
whose match score exceeds the value of t.
In closed-set identification, the ordered score sets from
the Nprobe probes are used to estimate the probability that
the correct matching identity pertaining to a probe is ob-
served within the top K (K ≤ N ) ranks (i.e., compute the
TPIR with t = 0). These probabilities are typically ex-
pressed visually through the Cumulative Match Character-
istic (CMC) curve [13]. Unlike the ROC curve, which is
generated by looking at genuine and impostor scores all-at-
once, the data in the CMC curve is obtained based on the
explicit ordering of NG − 1 and NG −NT genuine and im-
postor scores, respectively, for each biometric probe. As
such, we refer to the CMC curve as a rank-based metric.
An example of both a ROC and CMC curve is presented in
Figure 1.
1.2. Closedset Identification
In general, most biometric identification systems in real-
world applications operate in the open-set mode [8]. How-
ever, in the literature, most performance evaluations are
conducted in the closed-set mode [10, 17, 14]. For the
purposes of this study, we therefore focus on the closed-set
problem, with the intent of pointing out that reporting only
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
False Match Rate
Ge
nu
ine
Acce
pt
Ra
te
ROC Curve
0 5 10 150.75
0.8
0.85
0.9
0.95
1
Rank
Ide
ntifica
tio
n A
ccu
racy
CMC Curve
Figure 1. Example of an ROC curve (top) and CMC curve (bot-
tom).
identification accuracy in closed-set evaluations may not be
appropriate.
1.3. Relationship Between the ROC and CMC
If the ROC (aggregate-based) and CMC (rank-based)
curves are estimated from the same set of match scores, it
is not unreasonable to expect some degree of “correlation”
between the two curves. This topic has received some at-
tention in the literature, yielding mixed conclusions.
Phillips et. al. [13] first developed a measure for esti-
mating the CMC curve directly from the ROC curve.1 The
measure was found to consistently underestimate the values
of an experimentally derived CMC [9]. Later, Bolle et. al.
[1] argued that the CMC is directly related to the ROC and
can be used to deduce the performance of a 1:1 verification
system. Additionally, Bolle et. al. developed a mathemati-
cal model for estimating the CMC based on the ROC when
NG = 2. Similarly, Hube [9] also argued in favor of a di-
rect relationship between the ROC and CMC, developing a
different model for estimating the CMC from the ROC.
In the recent past, however, the notion that the ROC
and CMC are directly related has been challenged. Gorod-
nichy first presented an argument stating that aggregate-
based metrics such as the FMR, FNMR, and ROC fail to
appropriately evaluate operational systems characterized by
large sample size and non-static populations, or systems
performing identification at a distance (e.g., systems with-
out a controlled biometric acquisition protocol) [6, 7]. Fur-
ther, Gorodnichy argues that verification systems should be
evaluated (and developed) as 1:N identification systems [7],
stating that measures for identification (i.e., ranked statis-
tics) reveal more information regarding the relationships be-
tween users involved in a biometric system. DeCann and
Ross present a case arguing that it is theoretically possible
to observe a “poor” ROC curve and a “good” CMC curve
(and vice-versa) from the same set of match scores [4].
Based on the conclusions drawn from Bolle et. al. [1],
Hube [9], Gordnichy [6, 7], and DeCann and Ross [4], it
is clear that support in the literature for a direct relation-
ship between the ROC and CMC curves is mixed. In Fig-
1In this article, the terms “CMC curve” and “ROC curve” will be inter-
changeably used with the terms “CMC” and “ROC”, respectively
Appeared in Proc. of 6th IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS), (Washington DC, USA), September 2013
0 5 10 15 200.8
0.85
0.9
0.95
1
Rank
Identification A
ccura
cy
Empirical and Predicted CMC Curves (Fingerprint Scores)
Theoretical − Bolle
Theoretical − Hube
NG
= 2
0 5 10 15 20
0.85
0.9
0.95
1
Rank
Identification A
ccura
cy
Empirical and Predicted CMC Curves (Gait Scores)
Theoretical − Bolle
Theoretical − Hube
NG
= 2
Figure 2. Output of the CMC prediction models (from ROC
curves) by Bolle et. al. [1] and Hube [9] on match scores obtained
from a fingerprint matcher (top), and a gait matcher [3] (bottom).
Note that neither model perfectly predicts the CMC curve for both
sets of match scores.
ure 2, the CMC prediction models of Bolle et. al [1]. and
Hube [9] are compared on two different sets of match scores
generated by two different matching algorithms. The first
set of match scores represents gait scores generated using
a gait recognition algorithm [3] on the CASIA B dataset
[19]. Here, N = 124 and NG = 2. The second set
of match scores are fingerprint (left-index) scores from the
WVU Multimodal Dataset [2]. These scores were generated
using Verifinger,2 a commercial fingerprint matcher. Here,
N = 240 and NG = 2. Note that the intent of Figure 2 is
not to show the performance of the matchers, but rather to
analyze the ability of the two models to predict the empir-
ically obtained CMC curve. The data in Figure 2 suggests
the prediction models of Bolle et. al. and Hube do not ac-
curately estimate the CMC curve in all cases.
Although the data in Figure 2 demonstrates that there
may be some degree of “correlation” between the ROC
curve and CMC curve, it is clear that neither model com-
pletely predicted the empirical CMC curve based solely
on the ROC data. One reason this might be the case is
that aggregate-based statistics do not account for the unique
manner in which different individuals contribute towards
the overall performance of a biometric system. In other
words, the genuine and impostor score distributions pertain-
ing to two different individuals can be significantly differ-
ent. Such differences cannot be captured in aggregate statis-
tics. Visually, this is depicted in Figure 3, where a subset
of three individual genuine and impostor score distributions
are shown using the left-index (L1) match scores from the
2http://www.neurotechnology.com/verifinger.html
Figure 3. Visual example depicting the contribution of individual
identities towards the overall genuine and impostor match score
distributions, fG(s) and fI(s). Note that genuine and impostor
score distributions corresponding to an identity may be distinct
(above) and the aggregation of these individual distributions com-
prises the global genuine and impostor match score distributions
(below). Here, the individual match score distributions are based
on fingerprint scores computed on the WVU Multimodal Dataset
[2].
WVU Multimodal Dataset [2]. Note that each of the three
genuine and impostor distributions are different from one
another, and that the accumulation of these subsets result in
the aggregate distributions, fG(s) and fI(s).Doddington et. al. [5] first discussed the notion that dif-
ferent identities contribute differently towards overall bio-
metric system performance by introducing a scheme to
classify identities based on their propensity to generate a
false match or false non-match error in speaker recogni-
tion [5]. This observation is referred to as the Biometric
Menagerie in the literature [18]. If each identity contributes
to the performance of a biometric system differently, it may
be possible that for a single pair of genuine and impos-
tor match score distributions, multiple rank-based statistics
(e.g., CMC curves) can be generated. Further, these differ-
ences in rank-based statistics may result in multiple CMC
curves with large differences in cumulative rank-K accu-
racy.
In an earlier study, DeCann and Ross [4] demonstrated
that a “poor” ROC curve can produce a “good” CMC curve;
however, their analysis did not account for inter- and intra-
class relationships (as manifested through the match scores)
and did not demonstrate the possibility of associating mul-
tiple CMC curves with a single ROC curve. In this study,
our aim is to demonstrate this while accounting for such
relationships (the role of the Biometric Menagerie). By
modeling the inter- and intra-class relationships, it is pos-
Appeared in Proc. of 6th IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS), (Washington DC, USA), September 2013
sible to demonstrate that a fixed set of match scores can be
reassigned differently among N identities. This reassign-
ment of existing match scores to virtual identities is accom-
plished by utilizing the “Doddington Zoo” user classifica-
tion scheme.
Thus, the contributions of this study are as follows:
• Given a set of real match scores pertaining to multiple
identities, we describe a method by which the scores
can be reassigned to virtual identities such that they
describe different types of intra-class and inter-class
statistics based on the Doddington Zoo phenomenon.
• Based on this reassignment process, we demonstrate
that match scores sharing common aggregate statistics
(ROC) can have multiple ranked statistics (CMC’s).
2. Match Score Relationships in a Biometric
System
The model for characterizing inter- and intra-class rela-
tionships operates by assigning real match scores to virtual
identities. Here, a virtual identity is defined as an identity,
whose individual genuine and impostor match score distri-
butions, fnG(s) and fn
I (s) (n = 1, 2, . . . , N ), have been
sampled (without replacement) from sGen (with mean µGen
and variance σ2
Gen) and sImp (with mean µImp and vari-
ance σ2
Imp). Note that sGen and sImp denote sets of gen-
uine and impostor scores generated by a biometric matcher
on a dataset of N identities. For example, sGen and sImp
may be the fingerprint match scores illustrated in the bottom
of Figure 3.
In defining each virtual identity, an assumption is made
that the range of genuine and impostor scores for each vir-
tual identity is smaller than the range of the overall distri-
butions, fG(s) and fI(s). The “tightness” of these ranges
can be defined by the variance in match scores on a per-
identity basis. Define these per-identity variances as σ2
n−n
and σ2
n−m, where σ2
n−n denotes the average variance in
genuine scores for each identity and σ2
n−m denotes the av-
erage variance in impostor scores for each pair of identi-
ties. Here, we remark that the intent of this assumption is
to ensure created virtual identities do not share the same in-
dividual genuine and impostor match score distribution as
the aggregate genuine and impostor score distributions. The
output following the creation of each virtual identity is S, a
matrix of size NT xNT , wherein each column (or row) of
S contains match score information for one “virtual” bio-
metric sample, matched against NG − 1 samples from the
same “virtual” identity and NT −NG samples from the re-
maining N − 1 “virtual” identities. Note that this exercise
preserves the aggregate score statistics; what changes is the
set of match scores pertaining to every identity.
0 0.2 0.4 0.6 0.8 10
0.05
0.1
0.15
0.2
Score
Pr(
Score
)
Match Score Distributions and Doddington’s Zoo
Genuine Scores
Imposter Scores
Sheep(Genuine Scores)
Sheep (Impostor Scores)
Lambs(ImpostorScores)
Lambs(Genuine Scores)
Goats(Impostor Scores)
Goats(Genuine Scores)
Figure 4. Visual illustrating the general concept of the proposed
model for defining inter- and intra- class relationships in match
scores, which creates virtual identities based on the “Doddington’s
Zoo” framework [5].
2.1. Modeling Inter and Intraclass Variations
Our model for reassigning match scores to virtual
identities is inspired by the “Doddington’s Zoo” user-
classification scheme, which characterizes identities based
on their contribution towards the FMR and FNMR [5]. The
Doddington’s Zoo classification scheme consists of four
classes: Sheep, Goats, Lambs, and Wolves. Sheep are de-
fined as “well behaved” individuals who are easily recog-
nized and do not incorrectly match with others. Goats are
individuals who are intrinsically difficult to recognize and
contribute to false non-match errors. Lambs are individu-
als whose biometric data can often be confused with other
identities, resulting in false match errors. Finally, wolves
are defined as individuals who willfully and successfully
spoof the biometric data of other individuals, increasing the
rate of false match errors.
In terms of match scores, sheep can be loosely char-
acterized as having “high” genuine scores and “low” im-
postor scores. Meanwhile, goats can be loosely character-
ized as having “low” genuine scores. Finally, lambs (and
wolves) can be loosely characterized as having “high” im-
postor scores. These simple characterizations formulate the
basis of our model for reassigning scores to virtual identi-
ties, and is visually depicted in Figure 4.
The score reassignment model consists of two stages:
initialization and sampling. During initialization, each of N
virtual identities are assigned a label, χn (n = 1, 2, . . . , N ),
χn ∈ {Sheep,Goat, Lamb}. The number of virtual iden-
tities corresponding to each label is pre-specified (see Sec-
tion 3). Next, each identity is assigned match scores (from
the original score set) based on the properties of a “Sheep”,
“Goat”, or “Lamb”. Sampled match scores are drawn (with-
out replacement) from the original scores sGen and sImp,
and stored in SnGen and S
nImp, which are the reassigned
genuine and impostor scores for the nth virtual identity. Fi-
nally, a matrix of match scores of size NT xNT is created
(denoted by S). Each row in S stores the NG − 1 assigned
genuine scores and NT −NG assigned impostor scores for
each sample of a given virtual identity.
Appeared in Proc. of 6th IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS), (Washington DC, USA), September 2013
Algorithm 1: Reassigning Genuine Scores
Input: Vector sGen, containing the genuine scores.
Vector χ, a set containing the labels of each identity
(e.g., “Sheep”, “Goat”, “Lamb”).
Define: δ, ǫGen: Scaling parameters.
Output: Matrix S populated with genuine scores.
\\ begin algorithm
Step 1: For each identity, note the assigned label.
Step 2a: Draw a genuine score (without replacement), φ,
sGen, from within subset srng , where
srng = (µGen + σGen, 1), if χn = Sheep.
srng = (0, µGen − σGen), if χn = Goat.
srng = (0, µGen + σGen), if χn = Lamb.
Step 2b: If srng is a null set, and srng = (a, b),set a = δ · a, b = b
δand repeat Step 2a.
Step 3a: Draw(
NG
2
)
− 1 scores (without replacement)
from sGen within φ± ǫGen .
Step 3b: If less than(
NG
2
)
− 1 scores can be drawn
set ǫGen = ǫGen
δand repeat Step 3a.
Step 4: Store the sampled genuine scores in S.
return S
\\ end algorithm
Assignment of genuine scores to each virtual identity is a
relatively straightforward process. For each virtual identity,(
NG
2
)
genuine scores are drawn without replacement3 from
sGen and stored in S. Depending on the label of the vir-
tual identity, a target range from which scores will be sam-
pled, is first defined. This range is assumed to be between
(µGen+σGen, 1), (0, µGen−σGen), and (0, µGen+σGen)for “Sheep”, “Goats”, and “Lambs”, respectively. Denote
the subset of genuine scores within this range as srng. If
srng is a null set, the target range is opened (i.e., increased)
by multiplying (dividing) the lower (upper) bound of srngby a factor of δ (0 < δ < 1.0) until srng contains at least
one element. Next, one element (i.e., score) from srng is
sampled and stored in S. Denote the value of this score as φ.
The remaining(
NG
2
)
− 1 scores are sampled from the range
φ± ǫGen, where ǫGen is a tolerance parameter. As with the
range used to sample φ, if no match scores are found within
φ ± ǫGen, the range is opened by dividing ǫGen by δ. This
process for sampling genuine scores is summarized in Alg.
1. Note that this sampling method ensures that (a) sampled
genuine scores for each identity are consistent, and (b) the
genuine scores for a “Sheep” are distinct from those of a
“Goat”, and a “Lamb” (when possible).
Assignment of impostor scores to each virtual identity
captures the inter-class relationships between identities. As
such, assignment of impostor scores is viewed as being be-
tween pairs of identities (and therefore labels), rather than
for a single identity. This results in six possible scenarios,