Top Banner
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Monte Carlo Methods for Probabilistic Inference
37

CS b553 : A lgorithms for Optimization and Learning

Feb 23, 2016

Download

Documents

komala

CS b553 : A lgorithms for Optimization and Learning. Monte Carlo Methods for Probabilistic Inference. Agenda. Monte Carlo methods O(1/ sqrt (N)) standard deviation For Bayesian inference Likelihood weighting Gibbs sampling. Monte Carlo Integration. Estimate large integrals/sums: - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS  b553 : A lgorithms  for Optimization and Learning

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGMonte Carlo Methods for Probabilistic Inference

Page 2: CS  b553 : A lgorithms  for Optimization and Learning

AGENDA Monte Carlo methods

O(1/sqrt(N)) standard deviation For Bayesian inference

Likelihood weighting Gibbs sampling

Page 3: CS  b553 : A lgorithms  for Optimization and Learning

MONTE CARLO INTEGRATION Estimate large integrals/sums:

I = f(x)p(x) dx I = f(x)p(x)

Using a sample of N i.i.d. samples from p(x) I 1/N f(x(i))

Examples: [a,b] f(x) dx (b-a)/N f(x(i)) E[X] = x p(x) dx 1/N x(i)

Volume of a set in Rn

Page 4: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

Page 5: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

E[I-IN]=I-E[IN] (linearity of expectation)

Page 6: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

E[I-IN]=I-E[IN] (linearity of expectation)= E[f(x)] - 1/N E[f(x(i))] (definition of I

and IN)

Page 7: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

E[I-IN]=I-E[IN] (linearity of expectation)= E[f(x)] - 1/N E[f(x(i))] (definition of I

and IN)= 1/N (E[f(x)]-E[f(x(i))]) = 1/N 0 (x and x(i) are distributed

w.r.t. p(x))= 0

Page 8: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

Unbiased estimator What is the variance Var[IN]?

Page 9: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

Unbiased estimator What is the variance Var[IN]?

Var[IN] = Var[1/N f(x(i))] (definition)

Page 10: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

Unbiased estimator What is the variance Var[IN]?

Var[IN] = Var[1/N f(x(i))] (definition)= 1/N2 Var[ f(x(i))] (scaling of

variance)

Page 11: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

Unbiased estimator What is the variance Var[IN]?

Var[IN] = Var[1/N f(x(i))] (definition)= 1/N2 Var[ f(x(i))] (scaling of

variance)= 1/N2 Var[f(x(i))] (variance of a sum of

independent variables)

Page 12: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

Unbiased estimator What is the variance Var[IN]?

Var[IN] = Var[1/N f(x(i))] (definition)= 1/N2 Var[ f(x(i))] (scaling of

variance)= 1/N2 Var[f(x(i))]= 1/N Var[f(x)] (i.i.d. sample)

Page 13: CS  b553 : A lgorithms  for Optimization and Learning

MEAN & VARIANCE OF ESTIMATE Let IN be the random variable denoting the

estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

Unbiased estimator What is the variance Var[IN]?

1/N Var[f(x)] Standard deviation: O(1/sqrt(N))

Page 14: CS  b553 : A lgorithms  for Optimization and Learning

APPROXIMATE INFERENCE THROUGH SAMPLING Unconditional simulation:

To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed

Page 15: CS  b553 : A lgorithms  for Optimization and Learning

APPROXIMATE INFERENCE THROUGH SAMPLING Unconditional simulation:

To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed

Conditional simulation: To estimate the probability P(H) that a coin

picked out of bucket B flips heads: Repeat for i=1,…,N:1. Pick a coin C out of a random bucket b(i) chosen

with probability P(B)2. h(i) = flip C according to probability P(H|b(i))3. Sample (h(i),b(i)) comes from distribution P(H,B)

Result approximates P(H,B)

Page 16: CS  b553 : A lgorithms  for Optimization and Learning

MONTE CARLO INFERENCE IN BAYES NETS BN over variables X Repeat for i=1,…,N

In top-down order, generate x(i) as follows: Sample xj

(i) ~ P(Xj |paXj(i))

(RHS is taken by putting parent values in sample into the CPT for Xj)

Sample x(1)… x(N) approximates the

distribution over X

Page 17: CS  b553 : A lgorithms  for Optimization and Learning

APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION Sample from the joint distribution

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=0E=0A=0J=1M=0

Page 18: CS  b553 : A lgorithms  for Optimization and Learning

APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION As more samples are generated, the

distribution of the samples approaches the joint distribution

B=0E=0A=0J=1M=0

B=0E=0A=0J=0M=0

B=0E=0A=0J=0M=0

B=1E=0A=1J=1M=0

Page 19: CS  b553 : A lgorithms  for Optimization and Learning

BASIC METHOD FOR HANDLING EVIDENCE Inference: given evidence E=e (e.g., J=1),

approximate P(X/E|E=e) Remove the samples that conflict

B=0E=0A=0J=1M=0

B=0E=0A=0J=0M=0

B=0E=0A=0J=0M=0

B=1E=0A=1J=1M=0

Distribution of remaining samples approximates the conditional distribution

Page 20: CS  b553 : A lgorithms  for Optimization and Learning

RARE EVENT PROBLEM: What if some events are really rare (e.g.,

burglary & earthquake ?) # of samples must be huge to get a

reasonable estimate Solution: likelihood weighting

Enforce that each sample agrees with evidence While generating a sample, keep track of the

ratio of(how likely the sampled value is to occur in the real world)

(how likely you were to generate the sampled value)

Page 21: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

w=1

Page 22: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=0E=1

w=0.008

Page 23: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=0E=1A=1

w=0.0023

A=1 is enforced, and the weight updated to reflect the likelihood that this occurs

Page 24: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=0E=1A=1M=1J=1

w=0.0016

Page 25: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=0E=0

w=3.988

Page 26: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=0E=0A=1

w=0.004

Page 27: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=0E=0A=1M=1J=1

w=0.0028

Page 28: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=1E=0A=1

w=0.00375

Page 29: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=1E=0A=1M=1J=1

w=0.0026

Page 30: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

TTFF

TFTF

0.950.940.290.001

Burglary Earthquake

Alarm

MaryCallsJohnCalls

P(B)0.001

P(E)0.002

A P(J|…)TF

0.900.05

A P(M|…)

TF

0.700.01

B=1E=1A=1M=1J=1

w=5e-7

Page 31: CS  b553 : A lgorithms  for Optimization and Learning

LIKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

N=4 gives P(B|A,M)~=0.371 Exact inference gives P(B|A,M) = 0.375

B=0E=1A=1M=1J=1

w=0.0016

B=0E=0A=1M=1J=1

w=0.0028

B=1E=0A=1M=1J=1

w=0.0026

B=1E=1A=1M=1J=1

w~=0

Page 32: CS  b553 : A lgorithms  for Optimization and Learning

ANOTHER RARE-EVENT PROBLEM B=b given as evidence Probability each bi is rare given all but one

setting of Ai (say, Ai=1)

Chance of sampling all 1’s is very low => most likelihood weights will be too low

Problem: evidence is not being used to sample A’s effectively (i.e., near P(Ai|b))

A1 A2 A10

B1 B2 B10

Page 33: CS  b553 : A lgorithms  for Optimization and Learning

GIBBS SAMPLING Idea: reduce the computational burden of

sampling from a multidimensional distribution P(x)=P(x1,…,xn) by doing repeated draws of individual attributes Cycle through j=1,…,n Sample xj ~ P(xj | x[1…j-1,j+1,…n])

Over the long run, the random walk taken by x approaches the true distribution P(x)

Page 34: CS  b553 : A lgorithms  for Optimization and Learning

GIBBS SAMPLING IN BNS Each Gibbs sampling step: 1) pick a variable

Xi, 2) sample xi ~ P(Xi|X/Xi) Look at values of “Markov blanket” of Xi:

Parents PaXi Children Y1,…,Yk Parents of children (excluding Xi) PaY1/Xi, …,

PaYk/Xi Xi is independent of rest of network given Markov

blanket Sample xi~P(Xi|, Y1, PaY1/Xi, …, Yk, PaYk/Xi)

= 1/Z P(Xi|PaXi) P(Y1|PaY1) *…* P(Yk|PaYk) Product of Xi’s factor and the factors of its

children

Page 35: CS  b553 : A lgorithms  for Optimization and Learning

HANDLING EVIDENCE Simply set each evidence variable to its

appropriate value, don’t sample Resulting walk approximates distribution

P(X/E|E=e) Uses evidence more efficiently than

likelihood weighting

Page 36: CS  b553 : A lgorithms  for Optimization and Learning

GIBBS SAMPLING ISSUES Demonstrating correctness & convergence

requires examining Markov Chain random walk (more later)

Need to take many steps before the effects of poor initialization wear off (mixing time) Difficult to tell how much is needed a priori

Numerous variants Known as Markov Chain Monte Carlo techniques

Page 37: CS  b553 : A lgorithms  for Optimization and Learning

NEXT TIME Continuous and hybrid distributions