Data Analysis and Probabilistic Inferencedfg/ProbabilisticInference/ID... · 2018-07-20 · Data Analysis and Probabilistic Inference Lecture 8 Slide 32. Propagation Strategies for

Lecture 8

Approximate Inference

Data Analysis and Probabilistic Inference Lecture 8 Slide 1

Highly Dependent Data

Approach 1: Model all the dependencies:

DataDistribution

a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .




DataDistribution

a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .




DataDistribution

a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .

Propagating probabilities is difficult or infeasible!



Approach 2: Find the maximally weighted spanning tree:

DataDistribution

a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .




DataDistribution

a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .




DataDistribution

a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .

Loops are not allowed




DataDistribution

a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .




DataDistribution

a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .

Now the network does not model the dependenciesaccurately


Exact and Approximate Inference

• If we include all dependencies then computation isexact, but can be computationally infeasible for largesized networks and large data sets. We will look atexact computation algorithms later in the course.

• If we choose a spanning tree then message passingterminates in one pass and is very fast, but theinference is only approximate.


Problems with Loops in Networks

Issue 1: Looping Messages.




When only Node F is instantiatedthere is no condition that stops themessages travelling round the loop BC D E.




When only Node F is instantiatedthere is no condition that stops themessages travelling round the loop BC D E.

Exact propagation can still be carriedout if one of nodes B C or D isinstantiated



Issue 2: Independence of MultipleParents




When only Node A is instantiatedpropagation terminates. HoweverC and D are not independent andso the π evidence at E and F is notcorrect.




When only Node A is instantiatedpropagation terminates. However Cand D are not independent and so theπ evidence at E and F is not correct.

Exact propagation can still be carriedout if one of nodes B C or D isinstantiated. This will make C and Dindependent.


Approximate Inference Methods

Spanning trees and Naive Bayesian networks can beconsidered approximate inference methods They modelthe most important dependencies, though not all.

Their performance can be improved by a number oftechniques including:

1. Node Deletion

2. Allowing Loopy belief Propagation

3. Hidden (or Latent) Node Placement


Node Deletion

• Given a pair of highly dependent nodes it has beenfound that deleting one sometimes improves anetworks predictive performance.

• This is a surprising result from which it is difficult toinfer any general rule.

• Node deletion is something that can be testedexperimentally.


Selective Bayesian Network

The idea here is to use only a subset of the variables.

This can be done by starting with all the variables thendeleting any suspect variable and testing forimprovement in performance. Deletion continues untilno further improvement can be found.

We can find suspect variables by testing pairs of childrenfor high conditional dependence.


Selective Bayesian Networks

Equivalently we can add variables incrementally (independency order) and test the performance of eachnetwork.


Why can removing variables improve theperformance

• Consider a Bayesian networkwhere the same variableappears twice.

• Clearly conditionalindependence doesn’t hold

• The network will be biased infavour of C

• Deleting C will improveperformance

The improvement will depend on the quantity ofunaccounted dependency between variables.


Loopy Belief Propagation

Another approximate method is to include all arcsexpressing significant dependency, and allowpropagation to continue until:

• The probability distributions reach a stable state, or

• A limiting number of iterations has occurred (theremay be no termination)

Loopy belief propagation has been shown to beequivalent to a multivariate optimisation problem. It willmost likely find a local optimum. We cannot say anythingabout its accuracy.


Hidden Nodes (or Latent Variables)

If any two children of a parent node are not conditionallyindependent, they can be separated by a hidden node:

The new noderepresents a commoncause that relates Band C. It is calledhidden because wehave no correspondingmeasured variable.

Now we look at how to obtain it statistically.


Switch Nodes

Adding hidden nodes which act as switches can simplifycomplex networks.

Example from Neopolitan:


Advantages of Adding Hidden Nodes

A network can always perform as well with a hiddennode as it can without:








Using Hidden Nodes

In order to create a hidden node we need to:

1. decide how many states the hidden node is to have;

2. identify values for the three new link matricesintroduced.

It may be possible to obtain hidden node informationfrom an expert (eg the eyes example from lecture 2). Forexample an expert may:

1. identify a variable corresponding to the hidden node;

2. provide data for training (ie calculating the linkmatrices).

In general however this is not often possible.


How Many States?

• We expect that the number of states of a hidden nodewill be comparable to the number of states of thenodes its is separating.

• From the previous slides we would expect the hiddennode to have at least the same number of states asits parent.

• Link matrices with too many states will have very lowprobabilities for some states, so a possible approachis to sart with a large number of states and reducethe number depending on how many low probabilitystates we have.


Calculating the Conditional Probabilities

1. Given estimates of:P(H |A),P(B|H),P(C|H) and aset of data points [ai ,bj,ck ]

2. Use each bj,ck to computeP ′(A) from the network,calculate and accumulate anerror:

E = (P ′(A)−P(ai))2

3. Minimise E over the data setby adjusting the elements ofP(H |A),P(B|H),P(C|H)


Calculating the Conditional Probabilities

For each conditional probabilityP(cj|hk) we need to find a valuefor:

∂E∂P(cj |hk)

Then in each epoch we updatethe conditional probabilitiesusing:

P(cj|hk)⇒ P(cj|hk)−µ∂E

∂P(cj |hk)

Gradients may be calculated analytically or numerically.A closed form equation for the gradients was developedby Chee Keong Kwoh.


Gradient Descent and Probabilities

Gradient descent has problems when applied toprobability distributions. After one cycle of updating:• Distributions will no longer sum to 1• Individual probability values may be greater than 1

or less than 0

The conditional probability matrices must be normalisedso that the columns sum to 1. This may compromisefinding an optimal solution.


Propagation Strategies for calculating errors

Strategies may be alternated during the optimisation andthis produces annealing behaviour:


Hidden Nodes for Removing Loops

Suppose we build a network including all thedependencies, we can then use hidden nodes to removeany loops that were formed. In the case of the simpletriple we have seen that:

The process is to remove the least dependent link of amultiple parent.


Reducing bigger loops

We can apply the same process to bigger loops. Modellingthe dependency between C and D we get:

The training methods still work since for anyinstantiation of A or B the probability propagation willfinish.



We can continue by modelling the dependency between Band H1:

This results in a singly connected network but with twohidden nodes.



We can always reduce any network a singly connectedform by this method. One possible form of the Asianetwork is:

However, the largenumber of hidden nodesmakes the method lookless attractive.

The performance willbecome increasinglydependent on thetraining data andtraining process.



Could we simplify things by combining our two hiddennodes into one?:

The answer to this is very much data dependent. Clearlythe hidden node now has to model the dependencybetween B and C that comes through both the commonparent A and the common child D.


Limitations of the Hidden Node Method

There is clearly going to be a limit to the degree to whichwe can model dependencies through hidden nodes. Asthe dependencies become more complex, either:

1. We will need many hidden nodes, or

2. The number of states in the hidden node will becomevery large

In either case we may not have enough data to train thenew network.


Criteria for Introducing Hidden Nodes

Given a network we can measure the conditionaldependency of each pair of children given the parents.

If this is high we expect that benefits will occur fromintroducing a hidden node.

However below a certain threshold we are unlikely tobenefit from a hidden node and may choose to ignore thedependency.


Other Hidden Node Methodologies

Other more heuristic methods have been suggested foremploying hidden nodes.

Starting with a naive network:




Find all significant conditional dependencies:




Model them with hidden nodes:



A similar idea can be applied starting with a spanningtree;


Data Analysis and Probabilistic Inferencedfg/ProbabilisticInference/ID... · 2018-07-20 · Data Analysis and Probabilistic Inference Lecture 8 Slide 32. Propagation Strategies for

Documents