UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew U))
Jan 20, 2018
UIUC CS 497: Section EALecture #7
Reasoning in Artificial IntelligenceProfessor: Eyal Amir
Spring Semester 2004
(Based on slides by Gal Elidan (Hebrew U))
Last Time
• Probabilistic graphical models• Exact reasoning
– Variable elimination– Junction tree algorithm
• Applications (of Bayes Networks):– Sensor networks, medical diagnosis,
computer diagnosis (MS Windows), classification
Approximate Inference
• Large treewidth– Large, highly connected graphical models– Treewidth may be large (>40) in sparse networks
• In many applications, approximation are sufficient– Example: P(X = x|e) = 0.3183098861838– Maybe P(X = x|e) 0.3 is a good enough
approximation– e.g., we take action only if P(X = x|e) > 0.5
Today: Approximate reasoning via sampling
1. Monte Carlo techniques1. Rejection sampling2. Likelihood weighting3. Importance sampling
2. Markov Chain Monte Carlo (MCMC)1. Gibbs sampling2. Metropolis-Hastings
3. Applications du jour: ?
Types of ApproximationsAbsolute error• An estimate q of P(X = x | e) has
absolute error , ifP(X = x|e) - q P(X = x|e) +
equivalentlyq - P(X = x|e) q +
• Not always what we want: error 0.001– Unacceptable if P(X = x | e) = 0.0001,– Overly precise if P(X = x | e) = 0.3 0
1
q2
Types of ApproximationsRelative error• An estimate q of P(X = x | e) has
relative error , ifP(X = x|e)(1-) q P(X = x|e)
(1+)equivalentlyq/(1+) P(X = x|e) q/(1-)
• Sensitivity of approximation depends on actual value of desired result
0
1
q
q/(1+)
q/(1-)
Complexity
• Recall, exact inference is NP-hard• Is approximate inference any easier?
• Construction for exact inference:– Input: a 3-SAT problem – Output: a BN such that P(X=t) > 0 iff is
satisfiable
Complexity: Relative Error
• Suppose that q is a relative error estimate of P(X = t),
• If is not satisfiable, thenP(X = t)(1 - ) q P(X = t)(1 + )0 = P(X = t)(1 - ) q P(X = t)(1 + ) = 0
Thus, if q > 0, then is satisfiable
An immediate consequence:
Thm: Given , finding an -relative error approximation is NP-hard
Complexity: Absolute error
• Thm: If < 0.5, then finding an estimate of P(X=x|e) with absulote error approximation is NP-Hard
Search Algorithms
Idea: search for high probability instances• Suppose x[1], …, x[N] are instances with
high mass• We can approximate:
• If x[i] is a complete instantiation, then P(e|x[i]) is 0 or 1
i
i
[i])[i])P(|P
[i])[i])P(|yYP)yYP
xxe
xxee (
,(|(
Search Algorithms (cont)
• Instances that do not satisfy e, do not play a role in approximation
• We need to focus the search to find instances that do satisfy e
• Clearly, in some cases this is hard (e.g., the construction from our NP-hardness result
i
i
[i])[i])P(|P
[i])[i])P(|yYP)yYP
xxe
xxee (
,(|(
Stochastic Simulation
• Suppose we can sample instances <x1,…,xn> according to P(X1,…,Xn)
• What is the probability that a random sample <x1,…,xn> satisfies e?– This is exactly P(e)
• We can view each sample as tossing a biased coin with probability P(e) of “Heads”
Stochastic Sampling
• Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate
• Law of large number implies that as N grows, our estimate will converge to p whp
• How many samples do we need to get a reliable estimation?Use Chernof’s bound for binomial distributions
N
[i])|P
NHeads)P i
xee
(#(
Sampling a Bayesian Network
• If P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it?
• Idea: sample according to structure of the network– Write distribution using the chain rule, and
then sample each variable given its parents
Samples:
B E A C R
Logic samplingP(b) 0.03
P(e) 0.001
P(a)b e b e b e b e0.98 0.40.7 0.01
P(c)a a
0.8 0.05
P(r)e e
0.3 0.001
b
Earthquake
Radio
Burglary
Alarm
Call
0.03
Samples:
B E A C R
Logic samplingP(b) 0.03
P(e) 0.001
P(a)
b e b e b e b e0.98 0.40.7 0.01
P(c)
a a0.8 0.05
P(r)
e e0.3 0.001
eb
Earthquake
Radio
Burglary
Alarm
Call
0.001
Samples:
B E A C R
Logic samplingP(b) 0.03
P(e) 0.001
P(a)
b e b e b e b e0.98 0.40.7 0.01
P(c)
a a0.8 0.05
P(r)e e
0.3 0.001
e ab
0.4
Earthquake
Radio
Burglary
Alarm
Call
Samples:
B E A C R
Logic samplingP(b) 0.03
P(e) 0.001
P(a)
b e b e b e b e0.98 0.40.7 0.01
P(c)
a a0.8 0.05
P(r)
e e0.3 0.001
e a cb
Earthquake
Radio
Burglary
Alarm
Call
0.8
Samples:
B E A C R
Logic samplingP(b) 0.03
P(e) 0.001
P(a)b e b e b e b e0.98 0.40.7 0.01
P(c)a a
0.8 0.05
P(r)e e
0.3 0.001
e a cb r
0.3
Earthquake
Radio
Burglary
Alarm
Call
Logic Sampling
• Let X1, …, Xn be order of variables consistent with arc direction
• for i = 1, …, n do– sample xi from P(Xi | pai )– (Note: since Pai {X1,…,Xi-1}, we already
assigned values to them)
• return x1, …,xn
Logic Sampling
• Sampling a complete instance is linear in number of variables–Regardless of structure of the network
• However, if P(e) is small, we need many samples to get a decent estimate
Can sample from P(X1,…,Xn |e)?
• If evidence is in roots of network, easily• If evidence is in leaves of network, we have
a problem– Our sampling method proceeds according to
order of nodes in graph• Note, we can use arc-reversal to make
evidence nodes root. – In some networks, however, this will create
exponentially large tables...
Likelihood Weighting
• Can we ensure that all of our sample satisfy e?
• One simple solution:–When we need to sample a variable that is
assigned value by e, use the specified value• For example: we know Y = 1
–Sample X from P(X)–Then take Y = 1
• Is this a sample from P(X,Y |Y = 1) ?
X Y
Likelihood Weighting
• Problem: these samples of X from P(X)• Solution:
– Penalize samples in which P(Y=1|X) is small• We now sample as follows:
– Let x[i] be a sample from P(X)– Let w[i] be P(Y = 1|X = x [i])
X Y
i
i
iw
[i])x|XPiw)xXP ][
(][1|(
xY
Likelihood Weighting
• Why does this make sense?• When N is large, we expect to sample
NP(X = x) samples with x[i] = x• Thus,
• When we normalize, we get approximation of the conditional probability
)1,(
)|1()(][,
YxXNP
xXYPxXNPwxixi
i
Samples:
B E A C R
Likelihood WeightingP(b) 0.03
P(e) 0.001
P(a)b e b e b e b e0.98 0.40.7 0.01
P(c)a
0.8 0.05
P(r)e e
0.3 0.001
b
Earthquake
Radio
Burglary
Alarm
Call
0.03
Weight
= r
a
= a
Samples:
B E A C R
Likelihood WeightingP(b) 0.03
P(e) 0.001
P(a)
b e b e b e b e0.98 0.40.7 0.01
P(c)
a a0.8 0.05
P(r)
e e0.3 0.001
eb
Earthquake
Radio
Burglary
Alarm
Call
0.001
Weight
= r = a
Samples:
B E A C R
Likelihood WeightingP(b) 0.03
P(e) 0.001
P(a)
b e b e b e b e0.98 0.40.7 0.01
P(c)
a a0.8 0.05
P(r)e e
0.3 0.001
eb
0.4
Earthquake
Radio
Burglary
Alarm
Call
Weight
= r = a
0.6a
Samples:
B E A C R
Likelihood WeightingP(b) 0.03
P(e) 0.001
P(a)
b e b e b e b e0.98 0.40.7 0.01
P(c)
a a0.8 0.05
P(r)
e e0.3 0.001
e cb
Earthquake
Radio
Burglary
Alarm
Call
0.05Weight
= r = a
a 0.6
Samples:
B E A C R
Likelihood WeightingP(b) 0.03
P(e) 0.001
P(a)b e b e b e b e0.98 0.40.7 0.01
P(c)a a
0.8 0.05
P(r)e e
0.3 0.001
e cb r
0.3
Earthquake
Radio
Burglary
Alarm
Call
Weight
= r = a
a 0.6*0.3
Likelihood Weighting
• Let X1, …, Xn be order of variables consistent with arc direction
• w = 1• for i = 1, …, n do
–if Xi = xi has been observed• w w* P(Xi = xi | pai )
–else• sample xi from P(Xi | pai )
• return x1, …,xn, and w
Importance Sampling
• A method for evaluating expectation of f under P(x), <f>P(X)
• Discrete:
• Continuous:• If we could sample from P
dxxPxff
xPxff
XP
xXP
)()(
)()(
)(
)(
r
XP rxfR
f ])[(1)(
Importance SamplingA general method for evaluating <f>P(X) when we cannot sample from P(X).Idea: Choose an approximating distribution
Q(X) and sample from it
Using this we can now sample from Q and then
x XQx
XP XQXPxfdx
XQXQxPxfdxxPxfxf
)()( )(
)()()()()()()()()(
W(X)
M
m
M
mXP
mwmxfM
mXfM
xf1 1
)( )(])[(1])[(1)(
If we could generate samples from P(X)
Now that we generate the samples from Q(X)
(Unnormalized) Importance Sampling1. For m=1:M
Sample X[m] from Q(X)Calculate W(m) = P(X)/Q(X)
2. Estimate the expectation of f(X) using
Requirements: P(X)>0 Q(X)>0 (don’t ignore possible scenarios) Possible to calculate P(X),Q(X) for a specific X=x It is possible to sample from Q(X)
M
mXP
mwmxfM
xf1
)( )(])[(1)(
Normalized Importance SamplingAssume that we cannot evalute P(X=x) but can evaluate P’(X=x) = P(X=x)(for example we can evaluate P(X) but not P(X|e) in a Bayesian network)
We define w’(X) = P’(X)/Q(X). We can then evaluate :
and then:
In the last step we simply replace with the above equation
xx
XQαxP
XQXPXQXw )(')()(')()(' )(
)(
)()(
)(
)(')(')(
)(')(1)()()(')(1
)()()()()()()(
XQ
XQXQ
x
xxXP
XwXwXf
XwXfα
dxXQXQxPxf
α
dxXQXQxPxfdxxPxfxf
Normalized Importance SamplingWe can now estimate the expectation of f(X) similarly to unnormalized importance sampling by sampling x[m] from Q(X) and then
(hence the name “normalized”)
M
m
M
mXP
mw
mwmxfxf
1
1)(
)('
)('])[()(
Importance Sampling Weaknesses
• Important to choose sampling distribution with heavy tails– Not to “miss” large values of f
• Many-dimensional I-S:– “Typical set” of P may take a long time to find,
unless Q good approximation to P– Weights vary by factors exponential in N
• Similar for Likelihood Weighting
Today: Approximate Reasoning via Sampling
1. Monte Carlo techniques1. Rejection sampling2. Likelihood weighting3. Importance sampling
2. Markov Chain Monte Carlo (MCMC)1. Gibbs sampling2. Metropolis-Hastings
3. Applications du jour: ?
Stochastic Sampling
• Previously: independent samples to estimate P(X = x |e )
• Problem: difficult to sample from P(X1, …. Xn |e )
• We had to use likelihood weighting – Introduces bias in estimation
• In some case, such as when the evidence is on leaves, these methods are inefficient
MCMC Methods
• Sampling methods that are based on Markov Chain– Markov Chain Monte Carlo (MCMC) methods
• Key ideas:– Sampling process as a Markov Chain
• Next sample depends on the previous one– Approximate any posterior distribution
• Next: review theory of Markov chains
Markov Chains• Suppose X1, X2, … take some set of values
– wlog. These values are 1, 2, ...• A Markov chain is a process that corresponds
to the network:
• To quantify the chain, we need to specify– Initial probability: P(X1)– Transition probability: P(Xt+1|Xt)
• A Markov chain has stationary transition probability: P(Xt+1|Xt) same for all times t
X1 X2 X3 Xn... ...
Irreducible Chains
• A state j is accessible from state i if there is an n such that P(Xn = j | X1 = i) > 0– There is a positive probability of reaching j
from i after some number steps
• A chain is irreducible if every state is accessible from every state
Ergodic Chains
• A state is positively recurrent if there is a finite expected time to get back to state i after being in state i – If X has finite number of states, then this is
suffices that i is accessible from itself
• A chain is ergodic if it is irreducible and every state is positively recurrent
(A)periodic Chains
• A state i is periodic if there is an integer d such that when n is not divisible by d
P(Xn = i | X1 = i ) = 0• Intuition: only every d steps state i may
occur • A chain is aperiodic if it contains no
periodic state
Stationary Probabilities
Thm:• If a chain is ergodic and aperiodic, then the
limit
exists, and does not depend on i• Moreover, let
then, P*(X) is the unique probability satisfying
)|(lim 1 iXXP nn
)|(lim)( 1* iXjXPjXP nn
i
tt iXPiXjXPjXP )()|()( *1
*
Stationary Probabilities
• The probability P*(X) is the stationary probability of the process
• Regardless of the starting point, the process will converge to this probability
• The rate of convergence depends on properties of the transition probability
Sampling from the stationary probability
• This theory suggests how to sample from the stationary probability:– Set X1 = i, for some random/arbitrary i– For t = 1, 2, …, n
• Sample a value xt+1 for Xt+1 from P(Xt+1|Xt=xt)– return xn
• If n is large enough, then this is a sample from P*(X)
Designing Markov Chains
• How do we construct the right chain to sample from?– Ensuring aperiodicity and irreducibility is
usually easy
• Problem is ensuring the desired stationary probability
Designing Markov Chains
Key tool:• If the transition probability satisfies
then, P*(X) = Q(X)• This gives a local criteria for checking that
the chain will have the right stationary distribution
0)|1(whenever)()(
)|()|(
11
itXjtXPiXQjXQ
jXiXPiXjXP
tt
tt
MCMC Methods
• We can use these results to sample from P(X1,…,Xn|e)
Idea:• Construct an ergodic & aperiodic Markov
Chain such that P*(X1,…,Xn) = P(X1,…,Xn|e)
• Simulate the chain n steps to get a sample
MCMC Methods
Notes:• The Markov chain variable Y takes as
value assignments to all variables that are consistent evidence
• For simplicity, we will denote such a state using the vector of variables
}satisfy,...,|)()(,...,{)( 1111 enn xxXVXVxxYV
Gibbs Sampler
• One of the simplest MCMC method• Each transition changes the state of one Xi
• The transition probability defined by P itself as a stochastic procedure:– Input: a state x1,…,xn – Choose i at random (uniform probability)– Sample x’i from P(Xi|x1, …, xi-1, xi+1 ,…, xn, e)– let x’j = xj for all j i– return x’1,…,x’n
Correctness of Gibbs Sampler
• How do we show correctness?
Correctness of Gibbs Sampler
• By chain ruleP(x1,…,xi-1, xi, xi+1,…,xn|e) =P(x1,…,xi-1, xi+1,…,xn|e)P(xi|x1,…,xi-1, xi+1,…,xn, e)
• Thus, we get
• Since we choose i from the same distribution at each stage, this procedure satisfies the ratio criteria
),,,,,,|'(),,,,,,|(
)|,,,',,,()|,,,,,,(
111111
111111
ee
ee
niii
niii
niii
niiixxxxxPxxxxxP
xxxxxPxxxxxP
Transition
Gibbs Sampling for Bayesian Network
• Why is the Gibbs sampler “easy” in BNs?• Recall that the Markov blanket of a variable
separates it from the other variables in the network– P(Xi | X1,…,Xi-1,Xi+1,…,Xn) = P(Xi | Mbi )
• This property allows us to use local computations to perform sampling in each transition
Gibbs Sampling in Bayesian Networks
• How do we evaluate P(Xi | x1,…,xi-1,xi+1,…,xn) ?
• Let Y1, …, Yk be the children of Xi
– By definition of Mbi, the parents of Yj are in Mbi{Xi}
• It is easy to show that
i
j
j
x jyjii
jyjii
ii payPPaxP
payPPaxPMbxP
'
)|()|'(
)|()|()|(
Metropolis-Hastings
• More general than Gibbs (Gibbs is a special case of M-H)
• Proposal distribution arbitrary q(x’|x) that is ergodic and aperiodic (e.g., uniform)
• Transition to x’ happens with probability(x’|x)=min(1, P(x’)q(x|x’)/P(x)q(x’|x))
• Useful when computing P(x) infeasible• q(x’|x)=0 implies P(x’)=0 or q(x|x’)=0
Sampling Strategy
• How do we collect the samples?Strategy I:• Run the chain M times, each for N steps
– each run starts from a different state points• Return the last state in each run
M chains
Sampling Strategy
Strategy II:• Run one chain for a long time• After some “burn in” period, sample points
every some fixed number of steps
“burn in” M samples from one chain
Comparing StrategiesStrategy I:
– Better chance of “covering” the space of pointsespecially if the chain is slow to reach stationarity
– Have to perform “burn in” steps for each chain
Strategy II:– Perform “burn in” only once– Samples might be correlated (although only weakly)
Hybrid strategy: – Run several chains, sample few times each– Combines benefits of both strategies
Summary: Approximate Inference
• Monte Carlo (sampling with positive and negative error) Methods:– Pos: Simplicity of implementation and theoretical
guarantee of convergence– Neg: Can be slow to converge and hard to diagnose
their convergence.• Variational Methods – Your presentation• Loopy Belief Propagation and Generalized Belief
Propagation -- Your presentation
Next Time
• Combining Probabilities with Relations and Objects
THE END
Example: Naïve Bayesian Model
• A common model in early diagnosis:– Symptoms are conditionally independent given the disease (or
fault)• Thus, if
– X1,…,Xp denote whether the symptoms exhibited by the patient (headache, high-fever, etc.) and
– H denotes the hypothesis about the patients health
• then, P(X1,…,Xp,H) = P(H)P(X1|H)…P(Xp|H),
• This naïve Bayesian model allows compact representation– It does embody strong independence assumptions
Elimination on Trees
• Formally, for any tree, there is an elimination ordering with induced width = 1
Thm• Inference on trees is linear in number of
variables
Importance Sampling to LWWe want to compute P(Y=y|e)? (X is the set of random variables in the network and Y is some subset we are interested in)
1) Define a mutilated Bayesian network BZ=z to be a
network where:• all variables in Z are disconnected from their
parents and are deterministically set to z• all other variables remain unchanged
2) Choose Q to be BE=e
convince yourself that P’(X)/Q(X) is exactly P(Y=y|X)
3) Choose f(x) to be 1(Y[m]=y)/M4) Plug into the formula and you get exactly Likelihood Weighting Likelihood weighting is correct!!!
A Word of Caution
• Deterministic nodes– Not ergodic in the simple sense– M-H cannot be used