Lecture 3: Bayesian Decision Theory Dr. Chengjiang Long Computer Vision Researcher at Kitware Inc. Adjunct Professor at RPI. Email: [email protected]
Lecture 3: Bayesian Decision Theory
Dr. Chengjiang LongComputer Vision Researcher at Kitware Inc.
Adjunct Professor at RPI.Email: [email protected]
C. Long Lecture 3 January 28, 2018 2
Recap Previous Lecture
C. Long Lecture 3 January 28, 2018 3
Outline
• What's Beyesian Decision Theory?• A More General Theory• Discriminant Function and Decision Boundary• Multivariate Gaussian Density
C. Long Lecture 3 January 28, 2018 4
Outline
• What's Beyesian Decision Theory?• A More General Theory• Discriminant Function and Decision Boundary• Multivariate Gaussian Density
C. Long Lecture 3 January 28, 2018 5
Bayesian Decision Theory
• Design classifiers to make decisions subject to minimizing an expected "risk".• The simplest risk is the classification error (i.e.,
assuming that misclassification costs are equal).• When misclassification costs are not equal, the risk
can include the cost associated with different misclassifications.
C. Long Lecture 3 January 28, 2018 6
Terminology
• State of nature ω (class label): • e.g., ω1 for sea bass, ω2 for salmon
• Probabilities P(ω1) and P(ω2) (priors):• e.g., prior knowledge of how likely is to get a sea
bass or a salmon
• Probability density function p(x) (evidence): • e.g., how frequently we will measure a pattern with
feature value x (e.g., x corresponds to lightness)
C. Long Lecture 3 January 28, 2018 7
Terminology• Conditional probability density p(x/ωj) (likelihood) :
• e.g., how frequently we will measure a pattern with feature value x given that the pattern belongs to class ωj
C. Long Lecture 3 January 28, 2018 8
Terminology
• Conditional probability P(ωj /x) (posterior) :• e.g., the probability that the fish belongs to
class ωj given feature x.• Ultimately, we are interested in computing P(ωj /x)
for each class ωj.
C. Long Lecture 3 January 28, 2018 9
Decision Rule
or
• Favours the most likely class.• This rule will be making the same decision all times.
• i.e., optimum if no other information is available
C. Long Lecture 3 January 28, 2018 10
Decision Rule
• Using Bayes’ rule:
where
( / ) ( )( / )
( )j j
j
p x P likelihood priorP xp x evidencew w
w ´= =
2
1
( ) ( / ) ( )j jj
p x p x Pw w=
=å
Decideω1 if P(ω1 /x) > P(ω2 /x); otherwise decide ω2 orDecideω1 if p(x/ω1)P(ω1)>p(x/ω2)P(ω2); otherwise decideω2
orDecideω1 if p(x/ω1)/p(x/ω2) >P(ω2)/P(ω1) ; otherwise decide ω2
C. Long Lecture 3 January 28, 2018 11
Decision Rule
1 22 1( ) ( )3 3
P Pw w= = P(ωj /x)p(x/ωj)
C. Long Lecture 3 January 28, 2018 12
Probability of Error
• The probability of error is defined as:
or
• What is the average probability error?
• The Bayes rule is optimum, that is, it minimizes the average probability error!
C. Long Lecture 3 January 28, 2018 13
Where do Probabilities come from?
• There are two competitive answers:
Relative frequency (objective) approach. Probabilities can only come from experiments.
Bayesian (subjective) approach. Probabilities may reflect degree of belief and can be
based on opinion.
C. Long Lecture 3 January 28, 2018 14
Example: Objective approach
• Classify cars whether they are more or less than $ 50K: Classes: C1 if price >50K, C2 if price <= 50K Features: x, the height of a car
• Use the Bayes’ rule to compute the posterior probabilities:
• We need to estimate p(x/C1), p(x/C2), P(C1), P(C2)
( / ) ( )( / )( )i i
ip x C P CP C x
p x=
C. Long Lecture 3 January 28, 2018 15
Example: Objective approach
• Collect data Ask drivers how much their car was and measure height.
• Determine prior probabilities P(C1), P(C2) e.g., 1209 samples: #C1=221 #C2=988
1
2
221( ) 0.1831209988( ) 0.8171209
P C
P C
= =
= =
C. Long Lecture 3 January 28, 2018 16
Example: Objective approach
• Determine class conditional probabilities (likelihood) Discretize car height into bins and use normalized
histogram
Calculate the posterior probability for each bin:
( / )ip x C
1 11
1 1 2 2
( 1.0 / ) ( )( / 1.0)( 1.0 / ) ( ) ( 1.0 / ) ( )
0.2081*0.183 0.4380.2081*0.183 0.0597*0.817
p x C P CP C xp x C P C p x C P C
== = =
= + =
= =+
C. Long Lecture 3 January 28, 2018 17
Outline
• What's Beyesian Decision Theory?• A More General Theory• Discriminant Function and Decision Boundary• Multivariate Gaussian Density
C. Long Lecture 3 January 28, 2018 18
A More General Theory
Use more than one features. Allow more than two categories. Allow actions other than classifying the input to one of the possible
categories (e.g., rejection). Employ a more general error function (i.e., expected “risk”) by
associating a “cost” (based on a “loss” function) with different errors.
C. Long Lecture 3 January 28, 2018 19
Terminology
• Features form a vector• A set of c categories ω1, ω2, …, ωc • A finite set of l actions α1, α2, …, αl
• A loss function λ(αi / ωj) the cost associated with taking action αi when the
correct classification category is ωj
dRÎx
C. Long Lecture 3 January 28, 2018 20
Conditional Risk (or Expected Loss)
• Suppose we observe x and take action αi
• The conditional risk (or expected loss) with taking action αi is defined as:
1( / ) ( / ) ( / )
c
i i j jj
R a a Pl w w=
=åx x
From a medical image, we want to classify (determine) whether it contains cancer tissues or not.
C. Long Lecture 3 January 28, 2018 21
Overall Risk
• Suppose α(x) is a general decision rule that determines which action α1, α2, …, αl to take for every x.
• The overall risk is defined as:
• The optimum decision rule is the Bayes rule
( ( ) / ) ( )R R a p d= ò x x x x
C. Long Lecture 3 January 28, 2018 22
Overall Risk
• The Bayes rule minimizes R by:(i) Computing R(αi /x) for every αi given an x(ii) Choosing the action αi with the minimum R(αi /x)
• The resulting minimum R* is called Bayes risk and is the best (i.e., optimum) performance that can be achieved:
* minR R=
C. Long Lecture 3 January 28, 2018 23
Example: Two-category classification
• Define α1: decide ω1
α2: decide ω2
λij = λ(αi /ωj)• The conditional risks are:
1( / ) ( / ) ( / )
c
i i j jj
R a a Pl w w=
=åx x
C. Long Lecture 3 January 28, 2018 24
Example: Two-category classification
• Minimum risk decision rule:
or (i.e., using likelihood ratio)
or
likelihood ratio threshold
C. Long Lecture 3 January 28, 2018 25
Special Case: Zero-One Loss Function
• Assign the same loss to all errors:
• The conditional risk corresponding to this loss function:
C. Long Lecture 3 January 28, 2018 26
Special Case: Zero-One Loss Function
• The decision rule becomes:
• The overall risk turns out to be the average probability error!
or
or
C. Long Lecture 3 January 28, 2018 27
Example
• Assuming general loss:
• Assuming zero-one loss:Decide ω1 if p(x/ω1)/p(x/ω2)>P(ω2 )/P(ω1) otherwise decide ω2
2 1( ) / ( )a P Pq w w=
2 12 22
1 21 11
( )( )( )( )b
PPw l lqw l l
-=
-
12 21l l>assume:
C. Long Lecture 3 January 28, 2018 28
Outline
• What's Beyesian Decision Theory?• A More General Theory• Discriminant Function and Decision Boundary• Multivariate Gaussian Density• Error Bound, ROC, Missing Features and Compound
Bayesian Decision Theory• Summary
C. Long Lecture 3 January 28, 2018 29
Discriminant Functions
• A useful way to represent a classifier is through discriminant functions gi(x), i = 1, . . . , c, where a feature vector x is assigned to class ωi if
gi(x) > gj(x) for all j i
max
C. Long Lecture 3 January 28, 2018 30
Discriminants for Bayes Classifier
• Is the choice of gi unique? Replacing gi(x) with f(gi(x)), where f() is
monotonically increasing, does not change the classification results.
( / ) ( )( )( )
( ) ( / ) ( )( ) ln ( / ) ln ( )
i ii
i i i
i i i
p Pgp
g p Pg p P
w w
w ww w
=
=
= +
xxx
x xx x
gi(x)=P(ωi/x)
we’ll use thisdiscriminant extensively!
C. Long Lecture 3 January 28, 2018 31
Case of two categories
• More common to use a single discriminant function (dichotomizer) instead of two:
Examples:
1 2
1 1
2 2
( ) ( / ) ( / )( / ) ( )( ) ln ln( / ) ( )
g P Pp Pgp P
w ww ww w
= -
= +
x x xxxx
C. Long Lecture 3 January 28, 2018 32
Decision Regions and Boundaries
• Discriminants divide the feature space in decision regions R1, R2, …, Rc, separated by decision boundaries.
Decision boundary is defined by: g1(x)=g2(x)
C. Long Lecture 3 January 28, 2018 33
Outline
• What's Beyesian Decision Theory?• A More General Theory• Discriminant Function and Decision Boundary• Multivariate Gaussian Density
C. Long Lecture 3 January 28, 2018 34
Why are Gaussians so Useful?
• They represent many probability distributions in nature quite accurately. In our case, when patterns can be represented as random variations of an ideal prototype (represented by the mean feature vector)• Everyday examples: height, weight of a population
C. Long Lecture 3 January 28, 2018 35
Multivariate Gaussian Density
• A normal distribution over two or more variables (d variables/dimensions)
C. Long Lecture 3 January 28, 2018 36
The Covariance Matrix
• For our purposes...• Assume matrix is positive definite, so the determinant
of the matrix is always positive• Matrix Elements
• Main diagonal: variances for each individual variable• Off-diagonal: covariances of each variable pairing i & j
(note: values are repeated, as matrix is symmetric)
C. Long Lecture 3 January 28, 2018 37
Discriminant Function for Multivariate Gaussian Density
• We will consider three special cases for:• normally distributed features, and• minimum error rate classification (0-1 loss)
Recall:
C. Long Lecture 3 January 28, 2018 38
Minimum Error-Rate Discriminant Function forMultivariate Gaussian Feature Distributions
• ln (natural log) of
gives a general form for our discriminant functions:
C. Long Lecture 3 January 28, 2018 39
Special Cases for Binary Classification
• Purpose Overview of commonly assumed cases for feature likelihood densities,
• Goal: eliminate common additive constants in discriminant functions. These do not affect the classification decision (i.e. define providing “just the differences”)
• Also, look at resulting decision surfaces ( )• Three Special Cases
① Statistically independent features, identically distributed Gaussians for each class
② Identical covariances for each class③ Arbitrary covariances
C. Long Lecture 3 January 28, 2018 40
Case I:
• Satisfiy two conditions: (1) Features are statistically independent and (2) Each feature has the same variance.
• Remove Items in red: same across classes (“unimportant additive constants”)
• Inverse of covariance matrix: • Only effect is to scale vector product by• Discriminant function:
C. Long Lecture 3 January 28, 2018 41
Case I:
• Linear Discriminant Function• Produced by factoring the previous form
• Threshold or Bias for Class i:• Change in prior translates decision boundary
C. Long Lecture 3 January 28, 2018 42
Case I:
• Decision Boundary:
Decision boundary goes through x0 along line between means, orthogonal to this line
If priors equal, x0 between means (minimum distance classifier), otherwise x0 shifted
If variance small relative to distance between means, priors have limited effect on boundary location
C. Long Lecture 3 January 28, 2018 43
Case 1: Statistically Independent Features with Identical Variances
C. Long Lecture 3 January 28, 2018 44
Example: Translation of Decision Boundaries Through Changing Priors
C. Long Lecture 3 January 28, 2018 45
Case II: Identical Covariances
• Remove terms in red as in Case I these can be ignored (same across classes)
• Squared Mahalanobis Distance (yellow)• Distance from x to mean for class i, taking covariance
into account; defines contours of fixed density
C. Long Lecture 3 January 28, 2018 46
Case II: Identical Covariances
• Expansion of squared Mahalanobis distance
the last step comes from symmetry of the covariance matrix and thus its inverse:
• Once again, term above in red is an additive constant independent of class, and can be removed
C. Long Lecture 3 January 28, 2018 47
Multivariate Gaussian Density:
• Linear Discriminant Function
• Decision Boundary:
C. Long Lecture 3 January 28, 2018 48
Case II: Identical Covariances
• Notes on Decision Boundary• As for Case I, passes through point x0 lying on the line between the two
class means. Again, x0 in the middle if priors identical• Hyperplane defined by boundary generally not orthogonal to the line
between the two means
C. Long Lecture 3 January 28, 2018 49
Case III: arbitrary
Can only remove the one term in red aboveSquared Discriminant Function (quadratic)
C. Long Lecture 3 January 28, 2018 50
Case III: arbitrary
Decision Boundaries• Are hyperquadrics: can be hyperplanes,• hyperplane pairs, hyperspheres,• hyperellipsoids, hyperparabaloids,• hyperhyperparabaloids
Decision Regions• Need not be simply connected, even in one dimension (next
slide)
C. Long Lecture 3 January 28, 2018 51
Case III: arbitrary
C. Long Lecture 3 January 28, 2018 52
Case III: arbitrary
Nonlinear decision boundaries
C. Long Lecture 3 January 28, 2018 53
Example: Case III
P(ω1)=P(ω2)
decision boundary:
boundary doesnot pass throughmidpoint of μ1,μ2
C. Long Lecture 3 January 28, 2018 54