Confidence Metrics for Classification by deep Neural Networks March 15, 2019 Adam Oberman with Chris Finlay Math and Stats, McGill
Confidence Metrics for Classification by deep
Neural NetworksMarch 15, 2019
Adam Obermanwith Chris Finlay
Math and Stats, McGill
Challenges for deep learning“It is not clear that the existing AI paradigm is immediately amenable to any sort of software engineering validation and verification. This is a serious issue, and is a potential roadblock to DoD’s use of these modern AI systems, especially when considering the liability and accountability of using AI”
JASON report
Fact: the output “probabilities” of neural networks for image classification are not the probabilities that the classification is correct.
Misinterpretation: the output probabilities are not meaningful predictors of classification error.
this is correct. unlike other classifiers, e.g. Naive Bayes, there is no interpretation of the output of the network as a probability
in fact, we can extract useful information from the output, combined with the statistics of the loss on the test set.
Fact: if a neural network generalizes well , and is gives correct classifications 95% of the time (say), then we can estimate probability correct based on p_max
How?Suppose, for the sake of argument, that we are given, along with a prediction, the value of the loss, (but not the correct label).
Then we could have an imperfect, but much better idea of the probability that the prediction is correct.
• For small values < 0.8 of the loss, always correct. • For large values > 3 always incorrect• For intermediate values, make a histogram, with probability
correct in each bin.
Value of the loss
We have easy to compute metrics which are almost as good as the loss
How to measure predictive value?
Before presenting the results, we need to explain a way to measure the quality of a metric. This will allow us to compare the predictive value of different metrics on different data sets.
We will show that we can easily compute metrics which give great than10X improved confidence.
Moreover, we can define a simplified “green, yellow, red” zones, where the confidence is very high, moderate, and very low.
In the latter case, we value the fact that we have increased confidence in the probability of an error.
Measuring the odds
Bayes Factor measures the value of information
Whether to bet for or against depends on the new odds.
The Bayes Factor tells us how much our expected winnings increase, if we know the value of the test (and bet correctly).
Predicting preference for Voice vs Text
Histogram for age:
Value of age information
What is the expected value of knowing the age (without knowing in advance the range)? Expected Bayes Ratio.
Less valuable information: where they live
So compared to expected value of knowing the age of 22, the expected value of knowing location is low, 1.5.
Value of Model Entropy
Still can estimate when prediction is correct.• For small values < .001 of the function, always correct. • For large values > 1 correct less than 20% of the time• For intermediate values, make a histogram, with probability
correct in each bin.
Bayes Ratio for Loss, Entropy, U1, U2
Bayes ratio over equal 20 quantile bins for: loss, entropy, U1, U5. Very large Bayes ratio in the first 10 and last 3 bins. On the other hand, bin 15 for U5 provides very little value.
Expected Bayes Ratio Tables
High Cost
Comparison with other works
• Another confidence metric comes from Bayesian Dropout (Gal and Gharamani). In this case run inference 30 times with different random dropout. Confidence based on model variance. • less accurate• Very costly (30 X inference cost)
• Can also train an auxiliary neural network to predict whether another network is correct or wrong. • Have not compared accuracy for this method• But this is costly (1X inference)
• Compared to these methods, our method is less costly, and we also can provide theoretical guarrantees that it works (for small values of U under the assumptions of generalization).
Conclusions so far
• A confidence metric which is easy to compute (basically free) and which gives increased confidence in the probability that a prediction is correct.
• This measure can be adapted to Top 5, and can be used to detect increased probability of errors as well.
• On the larger dataset, ImageNet, the metric performs better relative to CIFAR-10.
Detecting Incorrect Labels• When we evaluate the
uncertainty metric, we find some outliers.
• These turn our to be ambiguous images in the test set.
What about off-manifold data?
Conclusions
• This tool for giving confidence (or uncertainty) to the classifications of neural networks has immediate applications to fields where confidence is valuable.
• Can also be used for • detecting errors in labels• detecting off manifold data, or adversarially
perturbed data