Probabilistic inference • Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence E = e – Partially observable, stochastic, episodic environment – Examples: X = {spam, not spam}, e = email message X = {zebra, giraffe, hippo}, e = image features • Bayes decision theory: – The agent has a loss function, which is 0 if the value of X is guessed correctly and 1 otherwise
Probabilistic inference. Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence E = e Partially observable, stochastic, episodic environment - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Probabilistic inference• Suppose the agent has to make a decision about
the value of an unobserved query variable X given some observed evidence E = e – Partially observable, stochastic, episodic environment– Examples: X = {spam, not spam}, e = email message
X = {zebra, giraffe, hippo}, e = image features• Bayes decision theory:
– The agent has a loss function, which is 0 if the value of X is guessed correctly and 1 otherwise
– The estimate of X that minimizes expected loss is the one that has the greatest posterior probability P(X = x | e)
– This is the Maximum a Posteriori (MAP) decision
MAP decision• Value of x that has the highest posterior
probability given the evidence e
)()|(maxarg)(
)()|()|(maxarg*
xPxePePxPxePexPx
x
x
MAP decision• Value of x that has the highest posterior
probability given the evidence e
)()|(maxarg)(
)()|()|(maxarg*
xPxePePxPxePexPx
x
x
)()|()|( xPxePexP likelihood priorposterior
MAP decision• Value of x that has the highest posterior
probability given the evidence e
• Maximum likelihood (ML) decision:
)()|(maxarg)(
)()|()|(maxarg*
xPxePePxPxePexPx
x
x
)()|()|( xPxePexP likelihood priorposterior
)|(maxarg* xePx x
Example: Naïve Bayes model• Suppose we have many different types of observations
(symptoms, features) F1, …, Fn that we want to use to obtain evidence about an underlying hypothesis H
• MAP decision involves estimating
– If each feature can take on k values, how many entries are in the joint probability table?
)()|,,(),,|( 11 HPHFFPFFHP nn
Example: Naïve Bayes model• Suppose we have many different types of observations
(symptoms, features) F1, …, Fn that we want to use to obtain evidence about an underlying hypothesis H
• MAP decision involves estimating
• We can make the simplifying assumption that the different features are conditionally independent given the hypothesis:
– If each feature can take on k values, what is the complexity of storing the resulting distributions?
)()|,,(),,|( 11 HPHFFPFFHP nn
n
iin HFPHFFP
11 )|()|,,(
Naïve Bayes Spam Filter• MAP decision: to minimize the probability of error, we should
classify a message as spam if P(spam | message) > P(¬spam | message)
Naïve Bayes Spam Filter• MAP decision: to minimize the probability of error, we should
classify a message as spam if P(spam | message) > P(¬spam | message)
• We have P(spam | message) P(message | spam)P(spam) and ¬P(spam | message) P(message | ¬spam)P(¬spam)
Naïve Bayes Spam Filter• We need to find P(message | spam) P(spam) and
P(message | ¬spam) P(¬spam)• The message is a sequence of words (w1, …, wn) • Bag of words representation
– The order of the words in the message is not important– Each word is conditionally independent of the others given
message class (spam or not spam)
Naïve Bayes Spam Filter• We need to find P(message | spam) P(spam) and
P(message | ¬spam) P(¬spam)• The message is a sequence of words (w1, …, wn) • Bag of words representation
– The order of the words in the message is not important– Each word is conditionally independent of the others given
message class (spam or not spam)
• Our filter will classify the message as spam if
n
iin spamwPspamwwPspammessageP
11 )|()|,,()|(
n
ii
n
ii spamwPspamPspamwPspamP
11
)|()()|()(
Bag of words illustration
US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/
Parameter estimation• In order to classify a message, we need to know the prior
P(spam) and the likelihoods P(word | spam) and P(word | ¬spam)– These are the parameters of the probabilistic model– How do we obtain the values of these parameters?
spam: 0.33¬spam: 0.67
P(word | ¬spam)P(word | spam)prior
Parameter estimation• How do we obtain the prior P(spam) and the likelihoods
P(word | spam) and P(word | ¬spam)?– Empirically: use training data
– This is the maximum likelihood (ML) estimate, or estimate that maximizes the likelihood of the training data:
P(word | spam) =# of word occurrences in spam messages
total # of words in spam messages
D
d
n
iidid
d
classwP1 1
,, )|(
d: index of training document, i: index of a word
Parameter estimation• How do we obtain the prior P(spam) and the likelihoods
P(word | spam) and P(word | ¬spam)?– Empirically: use training data
• Parameter smoothing: dealing with words that were never seen or seen too few times– Laplacian smoothing: pretend you have seen every vocabulary word
one more time than you actually did
P(word | spam) =# of word occurrences in spam messages
total # of words in spam messages
Summary of model and parameters• Naïve Bayes model:
• Model parameters:
n
ii
n
ii
spamwPspamPmessagespamP
spamwPspamPmessagespamP
1
1
)|()()|(
)|()()|(
P(spam)
P(¬spam)
P(w1 | spam)
P(w2 | spam)
…
P(wn | spam)
P(w1 | ¬spam)
P(w2 | ¬spam)
…
P(wn | ¬spam)
Likelihoodof spam
prior
Likelihoodof ¬spam
Bag-of-word models for images
Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)
Bag-of-word models for images1. Extract image features
Bag-of-word models for images1. Extract image features