UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring…

UIUC CS 497: Section EALecture #7

Reasoning in Artificial IntelligenceProfessor: Eyal Amir

Spring Semester 2004

(Based on slides by Gal Elidan (Hebrew U))

Last Time

• Probabilistic graphical models• Exact reasoning

– Variable elimination– Junction tree algorithm

• Applications (of Bayes Networks):– Sensor networks, medical diagnosis,

computer diagnosis (MS Windows), classification

Approximate Inference

• Large treewidth– Large, highly connected graphical models– Treewidth may be large (>40) in sparse networks

• In many applications, approximation are sufficient– Example: P(X = x|e) = 0.3183098861838– Maybe P(X = x|e) 0.3 is a good enough

approximation– e.g., we take action only if P(X = x|e) > 0.5

Today: Approximate reasoning via sampling

1. Monte Carlo techniques1. Rejection sampling2. Likelihood weighting3. Importance sampling

2. Markov Chain Monte Carlo (MCMC)1. Gibbs sampling2. Metropolis-Hastings

3. Applications du jour: ?

Types of ApproximationsAbsolute error• An estimate q of P(X = x | e) has

absolute error , ifP(X = x|e) - q P(X = x|e) +

equivalentlyq - P(X = x|e) q +

• Not always what we want: error 0.001– Unacceptable if P(X = x | e) = 0.0001,– Overly precise if P(X = x | e) = 0.3 0

1

q2

Types of ApproximationsRelative error• An estimate q of P(X = x | e) has

relative error , ifP(X = x|e)(1-) q P(X = x|e)

(1+)equivalentlyq/(1+) P(X = x|e) q/(1-)

• Sensitivity of approximation depends on actual value of desired result

0

1

q

q/(1+)

q/(1-)

Complexity

• Recall, exact inference is NP-hard• Is approximate inference any easier?

• Construction for exact inference:– Input: a 3-SAT problem – Output: a BN such that P(X=t) > 0 iff is

satisfiable

Complexity: Relative Error

• Suppose that q is a relative error estimate of P(X = t),

• If is not satisfiable, thenP(X = t)(1 - ) q P(X = t)(1 + )0 = P(X = t)(1 - ) q P(X = t)(1 + ) = 0

Thus, if q > 0, then is satisfiable

An immediate consequence:

Thm: Given , finding an -relative error approximation is NP-hard

Complexity: Absolute error

• Thm: If < 0.5, then finding an estimate of P(X=x|e) with absulote error approximation is NP-Hard

Search Algorithms

Idea: search for high probability instances• Suppose x[1], …, x[N] are instances with

high mass• We can approximate:

• If x[i] is a complete instantiation, then P(e|x[i]) is 0 or 1

i

i

[i])[i])P(|P

[i])[i])P(|yYP)yYP

xxe

xxee (

,(|(

Search Algorithms (cont)

• Instances that do not satisfy e, do not play a role in approximation

• We need to focus the search to find instances that do satisfy e

• Clearly, in some cases this is hard (e.g., the construction from our NP-hardness result

i

i

[i])[i])P(|P

[i])[i])P(|yYP)yYP

xxe

xxee (

,(|(

Stochastic Simulation

• Suppose we can sample instances <x1,…,xn> according to P(X1,…,Xn)

• What is the probability that a random sample <x1,…,xn> satisfies e?– This is exactly P(e)

• We can view each sample as tossing a biased coin with probability P(e) of “Heads”

Stochastic Sampling

• Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate

• Law of large number implies that as N grows, our estimate will converge to p whp

• How many samples do we need to get a reliable estimation?Use Chernof’s bound for binomial distributions

N

[i])|P

NHeads)P i

xee

(#(

Sampling a Bayesian Network

• If P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it?

• Idea: sample according to structure of the network– Write distribution using the chain rule, and

then sample each variable given its parents

Samples:

B E A C R

Logic samplingP(b) 0.03

P(e) 0.001

P(a)b e b e b e b e0.98 0.40.7 0.01

P(c)a a

0.8 0.05

P(r)e e

0.3 0.001

b

Earthquake

Radio

Burglary

Alarm

Call

0.03

Samples:

B E A C R


P(e) 0.001

P(a)

b e b e b e b e0.98 0.40.7 0.01

P(c)

a a0.8 0.05

P(r)

e e0.3 0.001

eb

Earthquake

Radio

Burglary

Alarm

Call

0.001

Samples:

B E A C R


P(e) 0.001

P(a)

b e b e b e b e0.98 0.40.7 0.01

P(c)

a a0.8 0.05

P(r)e e

0.3 0.001

e ab

0.4

Earthquake

Radio

Burglary

Alarm

Call

Samples:

B E A C R


P(e) 0.001

P(a)

b e b e b e b e0.98 0.40.7 0.01

P(c)

a a0.8 0.05

P(r)

e e0.3 0.001

e a cb

Earthquake

Radio

Burglary

Alarm

Call

0.8

Samples:

B E A C R


P(e) 0.001

P(a)b e b e b e b e0.98 0.40.7 0.01

P(c)a a

0.8 0.05

P(r)e e

0.3 0.001

e a cb r

0.3

Earthquake

Radio

Burglary

Alarm

Call

Logic Sampling

• Let X1, …, Xn be order of variables consistent with arc direction

• for i = 1, …, n do– sample xi from P(Xi | pai )– (Note: since Pai {X1,…,Xi-1}, we already

assigned values to them)

• return x1, …,xn

Logic Sampling

• Sampling a complete instance is linear in number of variables–Regardless of structure of the network

• However, if P(e) is small, we need many samples to get a decent estimate

Can sample from P(X1,…,Xn |e)?

• If evidence is in roots of network, easily• If evidence is in leaves of network, we have

a problem– Our sampling method proceeds according to

order of nodes in graph• Note, we can use arc-reversal to make

evidence nodes root. – In some networks, however, this will create

exponentially large tables...

Likelihood Weighting

• Can we ensure that all of our sample satisfy e?

• One simple solution:–When we need to sample a variable that is

assigned value by e, use the specified value• For example: we know Y = 1

–Sample X from P(X)–Then take Y = 1

• Is this a sample from P(X,Y |Y = 1) ?

X Y


• Problem: these samples of X from P(X)• Solution:

– Penalize samples in which P(Y=1|X) is small• We now sample as follows:

– Let x[i] be a sample from P(X)– Let w[i] be P(Y = 1|X = x [i])

X Y

i

i

iw

[i])x|XPiw)xXP ][

(][1|(

xY


• Why does this make sense?• When N is large, we expect to sample

NP(X = x) samples with x[i] = x• Thus,

• When we normalize, we get approximation of the conditional probability

)1,(

)|1()(][,

YxXNP

xXYPxXNPwxixi

i

Samples:

B E A C R

Likelihood WeightingP(b) 0.03

P(e) 0.001

P(a)b e b e b e b e0.98 0.40.7 0.01

P(c)a

0.8 0.05

P(r)e e

0.3 0.001

b

Earthquake

Radio

Burglary

Alarm

Call

0.03

Weight

= r

a

= a

Samples:

B E A C R


P(e) 0.001

P(a)

b e b e b e b e0.98 0.40.7 0.01

P(c)

a a0.8 0.05

P(r)

e e0.3 0.001

eb

Earthquake

Radio

Burglary

Alarm

Call

0.001

Weight

= r = a

Samples:

B E A C R


P(e) 0.001

P(a)

b e b e b e b e0.98 0.40.7 0.01

P(c)

a a0.8 0.05

P(r)e e

0.3 0.001

eb

0.4

Earthquake

Radio

Burglary

Alarm

Call

Weight

= r = a

0.6a

Samples:

B E A C R


P(e) 0.001

P(a)

b e b e b e b e0.98 0.40.7 0.01

P(c)

a a0.8 0.05

P(r)

e e0.3 0.001

e cb

Earthquake

Radio

Burglary

Alarm

Call

0.05Weight

= r = a

a 0.6

Samples:

B E A C R


P(e) 0.001

P(a)b e b e b e b e0.98 0.40.7 0.01

P(c)a a

0.8 0.05

P(r)e e

0.3 0.001

e cb r

0.3

Earthquake

Radio

Burglary

Alarm

Call

Weight

= r = a

a 0.6*0.3


• Let X1, …, Xn be order of variables consistent with arc direction

• w = 1• for i = 1, …, n do

–if Xi = xi has been observed• w w* P(Xi = xi | pai )

–else• sample xi from P(Xi | pai )

• return x1, …,xn, and w

Importance Sampling

• A method for evaluating expectation of f under P(x), <f>P(X)

• Discrete:

• Continuous:• If we could sample from P

dxxPxff

xPxff

XP

xXP

)()(

)()(

)(

)(

r

XP rxfR

f ])[(1)(

Importance SamplingA general method for evaluating <f>P(X) when we cannot sample from P(X).Idea: Choose an approximating distribution

Q(X) and sample from it

Using this we can now sample from Q and then

x XQx

XP XQXPxfdx

XQXQxPxfdxxPxfxf

)()( )(

)()()()()()()()()(

W(X)

M

m

M

mXP

mwmxfM

mXfM

xf1 1

)( )(])[(1])[(1)(

If we could generate samples from P(X)

Now that we generate the samples from Q(X)

(Unnormalized) Importance Sampling1. For m=1:M

Sample X[m] from Q(X)Calculate W(m) = P(X)/Q(X)

2. Estimate the expectation of f(X) using

Requirements: P(X)>0 Q(X)>0 (don’t ignore possible scenarios) Possible to calculate P(X),Q(X) for a specific X=x It is possible to sample from Q(X)

M

mXP

mwmxfM

xf1

)( )(])[(1)(

Normalized Importance SamplingAssume that we cannot evalute P(X=x) but can evaluate P’(X=x) = P(X=x)(for example we can evaluate P(X) but not P(X|e) in a Bayesian network)

We define w’(X) = P’(X)/Q(X). We can then evaluate :

and then:

In the last step we simply replace with the above equation

xx

XQαxP

XQXPXQXw )(')()(')()(' )(

)(

)()(

)(

)(')(')(

)(')(1)()()(')(1

)()()()()()()(

XQ

XQXQ

x

xxXP

XwXwXf

XwXfα

dxXQXQxPxf

α

dxXQXQxPxfdxxPxfxf

Normalized Importance SamplingWe can now estimate the expectation of f(X) similarly to unnormalized importance sampling by sampling x[m] from Q(X) and then

(hence the name “normalized”)

M

m

M

mXP

mw

mwmxfxf

1

1)(

)('

)('])[()(

Importance Sampling Weaknesses

• Important to choose sampling distribution with heavy tails– Not to “miss” large values of f

• Many-dimensional I-S:– “Typical set” of P may take a long time to find,

unless Q good approximation to P– Weights vary by factors exponential in N

• Similar for Likelihood Weighting

Today: Approximate Reasoning via Sampling

1. Monte Carlo techniques1. Rejection sampling2. Likelihood weighting3. Importance sampling

2. Markov Chain Monte Carlo (MCMC)1. Gibbs sampling2. Metropolis-Hastings

3. Applications du jour: ?

Stochastic Sampling

• Previously: independent samples to estimate P(X = x |e )

• Problem: difficult to sample from P(X1, …. Xn |e )

• We had to use likelihood weighting – Introduces bias in estimation

• In some case, such as when the evidence is on leaves, these methods are inefficient

MCMC Methods

• Sampling methods that are based on Markov Chain– Markov Chain Monte Carlo (MCMC) methods

• Key ideas:– Sampling process as a Markov Chain

• Next sample depends on the previous one– Approximate any posterior distribution

• Next: review theory of Markov chains

Markov Chains• Suppose X1, X2, … take some set of values

– wlog. These values are 1, 2, ...• A Markov chain is a process that corresponds

to the network:

• To quantify the chain, we need to specify– Initial probability: P(X1)– Transition probability: P(Xt+1|Xt)

• A Markov chain has stationary transition probability: P(Xt+1|Xt) same for all times t

X1 X2 X3 Xn... ...

Irreducible Chains

• A state j is accessible from state i if there is an n such that P(Xn = j | X1 = i) > 0– There is a positive probability of reaching j

from i after some number steps

• A chain is irreducible if every state is accessible from every state

Ergodic Chains

• A state is positively recurrent if there is a finite expected time to get back to state i after being in state i – If X has finite number of states, then this is

suffices that i is accessible from itself

• A chain is ergodic if it is irreducible and every state is positively recurrent

(A)periodic Chains

• A state i is periodic if there is an integer d such that when n is not divisible by d

P(Xn = i | X1 = i ) = 0• Intuition: only every d steps state i may

occur • A chain is aperiodic if it contains no

periodic state

Stationary Probabilities

Thm:• If a chain is ergodic and aperiodic, then the

limit

exists, and does not depend on i• Moreover, let

then, P*(X) is the unique probability satisfying

)|(lim 1 iXXP nn

)|(lim)( 1* iXjXPjXP nn

i

tt iXPiXjXPjXP )()|()( *1

*

Stationary Probabilities

• The probability P*(X) is the stationary probability of the process

• Regardless of the starting point, the process will converge to this probability

• The rate of convergence depends on properties of the transition probability

Sampling from the stationary probability

• This theory suggests how to sample from the stationary probability:– Set X1 = i, for some random/arbitrary i– For t = 1, 2, …, n

• Sample a value xt+1 for Xt+1 from P(Xt+1|Xt=xt)– return xn

• If n is large enough, then this is a sample from P*(X)

Designing Markov Chains

• How do we construct the right chain to sample from?– Ensuring aperiodicity and irreducibility is

usually easy

• Problem is ensuring the desired stationary probability

Designing Markov Chains

Key tool:• If the transition probability satisfies

then, P*(X) = Q(X)• This gives a local criteria for checking that

the chain will have the right stationary distribution

0)|1(whenever)()(

)|()|(

11

itXjtXPiXQjXQ

jXiXPiXjXP

tt

tt

MCMC Methods

• We can use these results to sample from P(X1,…,Xn|e)

Idea:• Construct an ergodic & aperiodic Markov

Chain such that P*(X1,…,Xn) = P(X1,…,Xn|e)

• Simulate the chain n steps to get a sample

MCMC Methods

Notes:• The Markov chain variable Y takes as

value assignments to all variables that are consistent evidence

• For simplicity, we will denote such a state using the vector of variables

}satisfy,...,|)()(,...,{)( 1111 enn xxXVXVxxYV

Gibbs Sampler

• One of the simplest MCMC method• Each transition changes the state of one Xi

• The transition probability defined by P itself as a stochastic procedure:– Input: a state x1,…,xn – Choose i at random (uniform probability)– Sample x’i from P(Xi|x1, …, xi-1, xi+1 ,…, xn, e)– let x’j = xj for all j i– return x’1,…,x’n

Correctness of Gibbs Sampler

• How do we show correctness?

Correctness of Gibbs Sampler

• By chain ruleP(x1,…,xi-1, xi, xi+1,…,xn|e) =P(x1,…,xi-1, xi+1,…,xn|e)P(xi|x1,…,xi-1, xi+1,…,xn, e)

• Thus, we get

• Since we choose i from the same distribution at each stage, this procedure satisfies the ratio criteria

),,,,,,|'(),,,,,,|(

)|,,,',,,()|,,,,,,(

111111

111111

ee

ee

niii

niii

niii

niiixxxxxPxxxxxP

xxxxxPxxxxxP

Transition

Gibbs Sampling for Bayesian Network

• Why is the Gibbs sampler “easy” in BNs?• Recall that the Markov blanket of a variable

separates it from the other variables in the network– P(Xi | X1,…,Xi-1,Xi+1,…,Xn) = P(Xi | Mbi )

• This property allows us to use local computations to perform sampling in each transition

Gibbs Sampling in Bayesian Networks

• How do we evaluate P(Xi | x1,…,xi-1,xi+1,…,xn) ?

• Let Y1, …, Yk be the children of Xi

– By definition of Mbi, the parents of Yj are in Mbi{Xi}

• It is easy to show that

i

j

j

x jyjii

jyjii

ii payPPaxP

payPPaxPMbxP

'

)|()|'(

)|()|()|(

Metropolis-Hastings

• More general than Gibbs (Gibbs is a special case of M-H)

• Proposal distribution arbitrary q(x’|x) that is ergodic and aperiodic (e.g., uniform)

• Transition to x’ happens with probability(x’|x)=min(1, P(x’)q(x|x’)/P(x)q(x’|x))

• Useful when computing P(x) infeasible• q(x’|x)=0 implies P(x’)=0 or q(x|x’)=0

Sampling Strategy

• How do we collect the samples?Strategy I:• Run the chain M times, each for N steps

– each run starts from a different state points• Return the last state in each run

M chains

Sampling Strategy

Strategy II:• Run one chain for a long time• After some “burn in” period, sample points

every some fixed number of steps

“burn in” M samples from one chain

Comparing StrategiesStrategy I:

– Better chance of “covering” the space of pointsespecially if the chain is slow to reach stationarity

– Have to perform “burn in” steps for each chain

Strategy II:– Perform “burn in” only once– Samples might be correlated (although only weakly)

Hybrid strategy: – Run several chains, sample few times each– Combines benefits of both strategies

Summary: Approximate Inference

• Monte Carlo (sampling with positive and negative error) Methods:– Pos: Simplicity of implementation and theoretical

guarantee of convergence– Neg: Can be slow to converge and hard to diagnose

their convergence.• Variational Methods – Your presentation• Loopy Belief Propagation and Generalized Belief

Propagation -- Your presentation

Next Time

• Combining Probabilities with Relations and Objects

THE END

Example: Naïve Bayesian Model

• A common model in early diagnosis:– Symptoms are conditionally independent given the disease (or

fault)• Thus, if

– X1,…,Xp denote whether the symptoms exhibited by the patient (headache, high-fever, etc.) and

– H denotes the hypothesis about the patients health

• then, P(X1,…,Xp,H) = P(H)P(X1|H)…P(Xp|H),

• This naïve Bayesian model allows compact representation– It does embody strong independence assumptions

Elimination on Trees

• Formally, for any tree, there is an elimination ordering with induced width = 1

Thm• Inference on trees is linear in number of

variables

Importance Sampling to LWWe want to compute P(Y=y|e)? (X is the set of random variables in the network and Y is some subset we are interested in)

1) Define a mutilated Bayesian network BZ=z to be a

network where:• all variables in Z are disconnected from their

parents and are deterministically set to z• all other variables remain unchanged

2) Choose Q to be BE=e

convince yourself that P’(X)/Q(X) is exactly P(Y=y|X)

3) Choose f(x) to be 1(Y[m]=y)/M4) Plug into the formula and you get exactly Likelihood Weighting Likelihood weighting is correct!!!

A Word of Caution

• Deterministic nodes– Not ergodic in the simple sense– M-H cannot be used

UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring…

Documents