Top Banner
Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm
21

Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Mar 21, 2016

Download

Documents

brita

Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm. The Burglar Alarm Problem. A burglar alarm is sensitive to both burglaries and earthquakes. In california earthquakes happen fairly frequently. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Page 2: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

The Burglar Alarm Problem

A burglar alarm is sensitive to both burglaries and earthquakes. In california earthquakes happen fairly frequently. You are at a conference far from california but

in phone contact with the alarm: You observe the alarm ring.

A = Alarm rings A C = Alarm does not ring

Page 3: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Alarm problem (continued)

The alarm could be due to a burglary or an earthquake: (these are assumed not to be observed)

1 & if a burglary takes placeb =

0 & otherwise1 & if there is an earthquake

e =0 & otherwise

Page 4: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Likelihood

The likelihood is concerned with what is observed – we observe whether the alarm goes off or not:

P(A|b=1,e=1)= .607 (the chance that the alarm goes off given a burglary and an earthquake)

P(A|b=0,e=1)=.135 (the chance that the alarm goes off given no burglary but an earthquake)

P(A|b=1,e=0)=.368 (the chance that the alarm goes off given a burglary but no earthquake)

P(A|b=0,e=0)= .001 (the chance that the alarm goes off given no burglary and no earthquake.

Page 5: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

PRIOR

The prior governs probability distributions over the presence/absence of burglaries, earthquakes:

P(b=1)=.1 P(e=1)=.1; b, e are mutually independentPriors characterize available

information about burglary,earthquakes

Page 6: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Bayes Theorem

Bayes Rule and related results:

Bayes Theorem (a consequence of above) serves to combine prior expertise with likelihood. Suppose ‘D’ stands for data and Θ for parameters. Then:

(Conditional Probability Rule)

(Multiplication Rule)

P(C and D)P(C | D) =P(D)

P(C and D) = P(D)P(C | D)

;

P(D | )P( )P( | D) =

P(D | )P( ) + P(D | )P( )

Page 7: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Bayes Theorem

We use Bayes’ theorem to find the probability that there was a burglary given that the alarm went off and the probability that there was an earthquake given that the alarm went off. To do this, we need to make use of two quantities:

a) the likelihood: the probability that the alarm went off given that the burglary did/didn’t take place and/or the earthquake did or did not take place.

b) the prior: the probability that the burglary did/didn’t take place and/or the earthquake did or didn’t take place.

Page 8: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Bayes Theorem for burglaries

We first update the likelihood relative to earthquakes and then use Bayes rule to calculate the probability of interest:

So, about 75% of the time when the alarm goes off, there is a burglary.

9 1 3919

9 144

P(A | b = 1) = P(A | b = 1,e = 0)P(e = 0) + P(A | b = 1, e = 1)P(e = 1) = . * .368 + . * .607 = .P(A | b = 0) = P(A | b = 0, e = 0)P(e = 0) + P(A | b = 0, e = 1)P(e = 1) = . * .001 + .1* .135 = .0

P(A | b = 1)P(b = 1)P(b = 1 | A) =P(A | b = 1)P(b = 1

39193919 144 9

=) + P(A | b = 0)P(b = 0)

. * .1 = .751. * .1 + .0 * .

Page 9: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

P(A|b=1,e=0) = .368

P(A|b=1,e=1) = .607

P(e=0) = .9

P(e=1) = .1

( | 1).1*.607 .9*.368 .3919P A b

Updating likelihood relative to earthquakes

Prior probabilities

Likelihoodprobabilities

New Likelihood probabilities

Page 10: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

P(A|b=0) = .0144

P( A c |b=0) = .9856

P(A|b=1) = .3919.

P(A c |b=1) = .6081

P(b=0) = .9

P(b=1) = .1

P(b=1 & A)=.1 * .3919

P(b=0 & A)=.9 * .0144

( 1| )P b A

P(A) =.05215

+( 1 )

( )P b and A

P A

.03919 .751

.05215

Bayes’ Law

Page 11: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Bayes Theorem for earthquakes

We first update the likelihood relative to burglaries and then calculate the probability of interest:

So, about 35% of the time when the alarm goes off, there is an earthquake.

P(A | e = 1) = P(A | b = 1,e = 1)P(b = 1) + P(A | b = 0, e = 1)P(b = 0) = .1* .607 + .9* .135 = .1822P(A | e = 0) = P(A | b = 0, e = 0)P(b = 0) + P(A | b = 1, e = 0)P(e = 1) = .9* .001 + .1* .368 = .0377

P(A | e = 1)P(e = 1)P(e = 1 | A) =P(A | e = 1)P(e = 1

=) + P(A | e = 0)P(e = 0)

.1822* .1 = .349.1822* .1 + .0377* .9

Page 12: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Expectation Maximization Algorithm

The EM algorithm concerns how to make inference about parameters in the presence of latent variables. Such variables are used to indicate the state of a process to which an observation belongs. Inference is based on estimating these parameters.

Page 13: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

EM algorithm for the alarm problem

Let b,e be latent variables. We now represent the prior on these latent variables by,

The EM algorithm yields estimates of the parameters by maximizing:

1 11 1 2 2( , ) (1 ) (1 )b b e eb e

1 1

2 2

Q = logξ P(b = 1 | A) + log(1- ξ )P(b = 0 | A) + (log ξ )P(e = 1 | A) + log(1- ξ )P(e = 0 | A)

Page 14: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

EM estimate

The EM estimates are,

1 2ξ = P(b = 1 | A) = .751; ξ = P(e = 1 | A) = .349;

Page 15: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

MCMC

Gibbs sampling is one example of Markov Chain Monte Carlo. The idea behind MCMC is to simulate a posterior distribution by visiting the values of parameters in proportion to their posterior probability. In Gibb’s sampling visits depend entirely on conditional posterior probabilities. Another form of MCMC is Metropolis Hastings (MH). In MH visiting depends on Markov Kernel.

Page 16: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Gibbs Sampling (MCMC)

Gibbs sampling is an iterative algorithm which successively visits the parameter values of b and e in proportion to their posterior probabilities.

GIBBS SAMPLER 1. Simulate b,e according to their priors. 2. For given b simulate e as a bernoulli variable

with probability P(e=1|b,A) with,

P(A | b,e = 1)P(e = 1)P(e = 1 | b, A) =P(A | b,e = 1)P(e = 1) + P(A | b,e = 0)P(e = 0)

.155 if b = 1=

.938 if b = 0

Page 17: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Gibb’s sampling (continued)

3. For the e obtained from step 2, simulate b from P(b=1|e,A) with,

4. Iterate steps 2 and 3. The proportion of times b=1 in this chain is P(b=1|A); the proportion of times e=1 in this chain is P(e=1|A).

P(A | b = 1,e)P(b = 1)P(b = 1 | e, A) =P(A | b = 1,e)P(b = 1) + P(A | b = 0,e)P(b = 0)

.333 & if e = 1=

.976 & if e = 0

Page 18: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Gibb’s sampler convergence: Cumulative Proportion of times that b=1 (burglary) in the sampler

Page 19: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Gibb’s sampler convergence: Cumulative Proportion of times that e=1 (earthquake) in the sampler

Page 20: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

Appendix: Derivation of EM

For data D, latent variables Z, and parameters Θ, Bayes theorem shows that,

Solving for (Θ|D) and taking logs we get: log(Θ|D)=log(Θ|D,Z)+log(Z|D)-log(Z| Θ,D) Integrating both sides over (Z|D, Θ#), we

get,

Θ | Z,D (Z | D) = Z | Θ,D Θ | D

# #

#

Θ | D = log Θ | D,Z Z | Θ dZ + log Z | D,Θ Z | Θ dZ

- log Z | D Z | Θ dZ

Page 21: Bayesian Statistics, MCMC, and the Expectation Maximization Algorithm

EM (continued)

It follows that the term log(Θ|D) is improved in Θ iff the first term on the right is improved.

The resulting quantity to be maximized in Θ is:

# #Q(Θ,Θ ) = Θ | D,Z Z | D,Θ dZ