• Discriminant functions are trained on a finite set of data • How much fitting should we do? • What should the model’s dimension be? • Model must be used to identify a piece of evidence (data) it was not trained with. • Accurate estimates for error rates of decision model are critical in forensic science applications. • The simplest is apparent error rate: Decision Model Validation
40
Embed
Discriminant functions are trained on a finite set of data How much fitting should we do? What should the model’s dimension be? Model must be used to.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
• Discriminant functions are trained on a finite set of data • How much fitting should we do?• What should the model’s dimension be?
• Model must be used to identify a piece of evidence (data) it was not trained with. • Accurate estimates for error rates of decision
model are critical in forensic science applications.
• The simplest is apparent error rate:• Error rate on training set
• Lousy estimate, but better than nothing
Decision Model Validation
• Cross-Validation: systematically hold-out chunks of data set for testing • Most common: Hold-one-out
1. Omit a data vector from X
2. Train model,
3. Classify held out observation
4. Repeat for all data vectors
• Simple but give a good estimate
• Lots of literature to back up its efficacy
Decision Model Validation
• C-fold cross-validation: hold out data chunks of size c.• Can become time consuming.
• Typically performance not much better than simple HOO-CV
• Caution! If decision model is sensitive to group sizes (e.g. CVA) cross-validation may not work well.• Should have at the very least, 5 replicates/group
Decision Model Validation
DON’T ARGUE WITH ME!!!!!!!!!!
• Bootstrap: Make up data sets with randomly selected observation vectors (with replacement)• Bootstrap sample is the same size as X• You’ll get repeats
1. Train a decision model with the bootstrapped set• Model should not be sensitive to repeated observations!
• CVA is out!!!!
2. Test model with original X and compute error:
Decision Model Validation
Decision rules builtwith bootstrapped data set
3. Test model with bootstrapped data set X* and compute error:
4. Repeat 1-3, B times:• B should be at least 200
5. Compute average “optimism”:
6. Compute the “refined” bootstrap error rate:
Decision Model Validation
Number of times obs. vect. occurs in X*
*Now Exercise: Explore some data sets with:
boostrap.R cv_boot_testset.R
• t is a test for association between:• xunk, data from an unknown
• Could be from crime scene
• Could be from suspect
• A group of data from a source• Could be from suspect
• Could be from crime scene
• ANY decision rule output by a pattern recognition program can be considered as a test for association
Probabilities
• Codes:• t+/- : Test indicates inclusion/exclusion
• S+/- : Evidence is/is not associated with a source
• Four probabilities are of interest:• Probability that a test yields a positive association
given that there is truly an association between evidence and a source:
• TPR is very important for forensic applications!
Probabilities
= probability of a true positive (TP)= true positive rate (TPR)= probability of a true inclusion= sensitivity
• Probability that a test yields a positive association given that there is truly no association between evidence and a source:
• FPR is very important for forensic applications!
• In traditional hypothesis testing, FPR is sometimes called
• 1-FPR = specificity (TNR): rate at which true exclusions are correctly excluded
Probabilities
= probability of a false positive (FP)= false positive rate (FPR)= probability of a false inclusion
• Probability that a test yields a negative association given that there is truly no association between evidence and a source:
• TNR estimates may be the most useful (and trustworthy) numbers that come out of applications probability to physical evidence...
Probabilities
= probability of a true negative (TN)= true negative rate= probability of a true exclusion= specificity
• Probability that a test yields a negative association given that there is truly an association between evidence and a source:
• In traditional hypothesis testing, FN is sometimes called
• 1-FNR = sensitivity (TPR): rate at which true inclusions are correctly included
Probabilities
= probability of a false negative (FN)= false negative rate (FNR)= probability of a false exclusion
• Summary:
• 1- is called test’s power
• Remember, these are all only ESTIMATES!
An association truly exists, S+
An association truly does not exist, S-
Test indicates an inclusion, t+
True Positive Rate False Positive Rate Type I error
Test indicates an exclusion, t-
False Negative RateType II error
True Negative Rate
Probabilities
Probabilities• Much more difficult to objectively estimate,
but of more interest in Law applications:• Probability that an association exists given a test
indicates an association:
• Probability that an no association exists given a test indicates an association:
Bayes’ Rule Again
Prior probability thatthere is an associationbetween evidence anda source…
Also called positivepredictive value (PV+)
Probabilities• Dividing these, we get the “famous” (positive)
likelihood ratio LR+:
• LR+ can be expressed as:
LikelihoodRatio
Prior odds in favorof association
Odds form of Bayes’ Rule
Posterior odds in favorof association given testindicates inclusion
Probabilities• LR+ interpretations:• Ratio of: probability test indicates inclusion given
a true association vs. probability test indicates inclusion given a true exclusion
• LR+ serves as a multiplier for the prior odds in favor of an association
• LR+ gives relative effect of same source origin odds given a positive test result
Probabilities• Note: In building a decision model, TPR, TNR,
FPR, FNR and LR+ are computed on a per group basis• There is no overall TPR, TNR, FPR, FNR and LR+!
• Value comes into forensic science if one of the groups is a known suspect or crime scene group, AND:• Unknowns are tested against suspect/crime scene
group• Confidence measures in the results are: TPR, FPR and
LR+ computed on the suspect/crime scene group
Probabilities• How can these be used/stated in court?
?Striation pattern foundat a crime scene (CS)
• Same class characteristics as C.S.• Subclass characteristics eliminated from data
Many striation patterns generatedby a tool associated with a suspect (SP)
Include SP set in database (DB) and compute/test discrimination model
• Get TP, FP and LR+ for SP wrt/ DB
• I.D. CS with discrimination model• Result is inclusion or exclusion• TP, FP and LR+ for SP apply
to result• State in court along with size of DB
Receiver Operating Characteristic• In general a classification rule t, applied to a
data point yields a score, t(x) = score
• For two groups, consider score distributions• Two groups can be right vs. wrong, pos vs. neg,
assoc. vs. no assoc., one vs. rest, one vs. one, etc.
score
cut-offscore
Receiver Operating Characteristic• The cut off score is adjustable• Different choices give different TPR and FPR
• Cut off is related to prior
• Changing cut off traces out a curve on a graph of TPR vs. FPR = ROC curve
TPR
FPR0 1
1
AUC = Mann-Whitney U
“chance” diagonal
*Now Exercise: Source roc_utilities.R
play with roc.R for PLS-DA
Receiver Operating Characteristic• “Chance” diagonal: If your ROC curve looks
like this• Score distributions for two groups are right on top
of each other
• 50/50 chance of assigning an unknown to the correct group.
• Area under curve (AUC): Probability of misclassification (estimated test error rate)• AUC range = 0 to 1 (really 0.5 to 1)*
• Gini coefficient: “Degree of inequality of ROC curve from chance diagonal = 2AUC-1
How good of a “match” is it?Conformal PredictionVovk
• Data should be IID but that’s it C
umul
ativ
e #
of E
rror
s
Sequence of Unk Obs Vects
80% confidence20% errorSlope = 0.2
95% confidence5% errorSlope = 0.05
99% confidence1% errorSlope = 0.01
• Can give a judge or jury an easy to understand measure of reliability of classification result
• This is an orthodox “frequentist”
approach• Roots in Algorithmic Information
Theory
• Confidence on a scale of 0%-100%
• Testable claim: Long run I.D. error-rate should be the chosen significance level
How Conformal Prediction works for us• Given a “bag” of obs with known identities and one obs of
unknown identityVovk
• Estimate how “wrong” labelings are for each observation with a non-conformity score (“wrong-iness”)
• Looking at the “wrong-iness” of known observations in the bag:
• Does labeling-i for the unknown have an unusual amount of “wrong-iness”??:
• For 95%-CPT (PCA-SVM) confidence intervals will not contain the correct I.D. 5% of the time in the long run• Straight-forward validation/explanation picture for
court
Conformal Prediction Drawbacks
• CPT is an interval method• Can (and does) produce multi-label I.D. intervals• A “correct” I.D. is an interval with all labels• Doesn’t happen often in practice…
• Empty intervals count as “errors”• Well…, what if the “correct” answer isn’t in the database• An “Open-set” problem which Champod, Gantz and
Saunders have pointed out
• Must be run in “on-line” mode for LRG
• After 500+ I.D. attempts run in “off-line” mode we noticed in practice
• An I.D. is output for each questioned toolmark• This is a computer “match”
• What’s the probability it is truly not a “match”?
• Similar problem in genomics for detecting disease from microarray data• They use data and Bayes’ theorem to get an
estimateNo diseasegenomics = Not a true “match”toolmarks
How good of a “match” is it?Efron Empirical Bayes’
Random Match Probability
{
Distribution of nDs from fragments at crime scene
Distribution of nDs from fragments in population
99% of nDs from crime scene fragments: RMP “window”
Shaded area Prob. random frag from pop would be IDd as CS frag
Random Match Probability• Example
RMP ≈ (0.26 + 0.14 + 0.06)×100 = 46%
Distribution of nDs from glass fragments at crime sceneDistribution of nDs
from glass fragments in population
Random Match Probability• Problems with Random Match Probability
Computations• To get reliable probabilities, need accurate probability
density functions (pdfs)• Higher dimensional pdfs require exponential amounts
of data to accurately fit (curse of dimensionality)
• Overlap in higher dimensions??
• How wide should RMP “windows” be?
• Use distributions for univariate “similarity” measures?
• Different measures correspond to different RMPs!
• No natural choice!
Empirical Bayes’• We use Efron’s machinery for “empirical
Bayes’ two-groups model”Efron
• Surprisingly simple!
• Use binned data to do a Poisson regression
• Some notation:
• S-, truly no association, Null hypothesis
• S+, truly an association, Non-null hypothesis
• z, a score derived from a machine learning task to I.D. an unknown pattern with a group• z is a Gaussian random variate for the Null
Empirical Bayes’• From Bayes’ Theorem we can getEfron:
Estimated probability of not a true “match” given the algorithms' output z-score associated with its “match”
Names: Posterior error probability (PEP)Kall
Local false discovery rate (lfdr)Efron
• Suggested interpretation for casework:• We agree with Gelaman and ShaliziGelman:
= Estimated “believability” of machine made association
“…posterior model probabilities …[are]… useful as tools for prediction and for understanding structure in data, as long as these probabilities are not taken too seriously.”
Empirical Bayes’• Bootstrap procedure to get estimate of the KNM distribution of
“Platt-scores”Platt,e1071
• Use a “Training” set
• Use this to get p-values/z-values on a “Validation” set
• Inspired by Storey and Tibshirani’s Null estimation methodStorey
z-score
From fit histogram by Efron’s method get:
“mixture” density
We can test the fits to
and !
What’s the point??
z-density given KNM => Should be Gaussian
Estimate of prior for KNM
• Use SVM to get KM and KNM “Platt-score” distributions
Bootstrap sampleTrain SVMGet Platt scores on whole set
Toss KM Platt scoresToss obs. in bootstrap sampleRandomly select a KNM scorefrom each obs.Collect
RepeatBootstrap algorithm to Estimate KNM distribution (The Null)
Estimate of log KNM Platt-score distribution
• Fit of log(KNM) to parametric form helps us avoid plethora of 0 p-values for KM validation set
• “Problem” p-values now
Validation Set
Sample to get a set of IID simulated log(KNM-scores)(“reusing data” less too…??)
Check assumptions on the Null
Uniform Null p-values
Close to N(0,1)Null z-values
Lump together as the “validation set”
Compute p-values for thevalidation set from the fit null
Use locfdrlocfdr Fit classic Poisson regression for f(z)
Use modified locfdr/JAGSJAGS,Plummer or StanStan
Fit Bayesian hierarchal Poisson regressions
z z
Fit local-fdr models
Posterior Association Probability: Believability Curve
12D PCA-SVM locfdr fit for Glock primer shear patterns
+/- 2 standard errors
Bayesian over-dispersed Poisson with intercept on test setBayesian Poisson with intercept on test set
Poisson (Efron) on test set Bayesian Poisson on test set
Bayes Factors/Likelihood Ratios
• In the “Forensic Bayesian Framework”, the Likelihood Ratio is the measure of the weight of evidence.• LRs are called Bayes Factors by most statistician
• LRs give the measure of support the “evidence” lends to the “prosecution hypothesis” vs. the “defense hypothesis”
• From Bayes Theorem:
Bayes Factors/Likelihood Ratios
• Once the “fits” for the Empirical Bayes method are obtained, it is easy to compute the corresponding likelihood ratios.o Using the identity:
the likelihood ratio can be computed as:
Bayes Factors/Likelihood Ratios • Using the fit posteriors and priors we can obtain the likelihood ratiosTippett, Ramos
Known match LR values
Known non-match LR values
Empirical Bayes’: Some Things That Bother Me
• Need a lot of z-scores• Big data sets in forensic science largely don’t exist
• z-scores should be fairly independent• Especially necessary for interval estimates around
lfdrEfron
• Requires “binning” in arbitrary number of intervals• Also suffers from the “Open-set” problem• Interpretation of the prior probability for this
application• Should Pr(S-) be 1 or very close to it? How close?