Lecture 8 Approximate Inference Data Analysis and Probabilistic Inference Lecture 8 Slide 1
Lecture 8
Approximate Inference
Data Analysis and Probabilistic Inference Lecture 8 Slide 1
Highly Dependent Data
Approach 1: Model all the dependencies:
DataDistribution
a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .
Data Analysis and Probabilistic Inference Lecture 8 Slide 2
Highly Dependent Data
Approach 1: Model all the dependencies:
DataDistribution
a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .
Data Analysis and Probabilistic Inference Lecture 8 Slide 3
Highly Dependent Data
Approach 1: Model all the dependencies:
DataDistribution
a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .
Propagating probabilities is difficult or infeasible!
Data Analysis and Probabilistic Inference Lecture 8 Slide 4
Highly Dependent Data
Approach 2: Find the maximally weighted spanning tree:
DataDistribution
a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .
Data Analysis and Probabilistic Inference Lecture 8 Slide 5
Highly Dependent Data
Approach 2: Find the maximally weighted spanning tree:
DataDistribution
a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .
Data Analysis and Probabilistic Inference Lecture 8 Slide 6
Highly Dependent Data
Approach 2: Find the maximally weighted spanning tree:
DataDistribution
a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .
Loops are not allowed
Data Analysis and Probabilistic Inference Lecture 8 Slide 7
Highly Dependent Data
Approach 2: Find the maximally weighted spanning tree:
DataDistribution
a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .
Data Analysis and Probabilistic Inference Lecture 8 Slide 8
Highly Dependent Data
Approach 2: Find the maximally weighted spanning tree:
DataDistribution
a1 b3 c2 d1a3 b2 c4 d2a5 b1 c1 d3. . . .. . . .. . . .
Now the network does not model the dependenciesaccurately
Data Analysis and Probabilistic Inference Lecture 8 Slide 9
Exact and Approximate Inference
• If we include all dependencies then computation isexact, but can be computationally infeasible for largesized networks and large data sets. We will look atexact computation algorithms later in the course.
• If we choose a spanning tree then message passingterminates in one pass and is very fast, but theinference is only approximate.
Data Analysis and Probabilistic Inference Lecture 8 Slide 10
Problems with Loops in Networks
Issue 1: Looping Messages.
Data Analysis and Probabilistic Inference Lecture 8 Slide 11
Problems with Loops in Networks
Issue 1: Looping Messages.
When only Node F is instantiatedthere is no condition that stops themessages travelling round the loop BC D E.
Data Analysis and Probabilistic Inference Lecture 8 Slide 12
Problems with Loops in Networks
Issue 1: Looping Messages.
When only Node F is instantiatedthere is no condition that stops themessages travelling round the loop BC D E.
Exact propagation can still be carriedout if one of nodes B C or D isinstantiated
Data Analysis and Probabilistic Inference Lecture 8 Slide 13
Problems with Loops in Networks
Issue 2: Independence of MultipleParents
Data Analysis and Probabilistic Inference Lecture 8 Slide 14
Problems with Loops in Networks
Issue 2: Independence of MultipleParents
When only Node A is instantiatedpropagation terminates. HoweverC and D are not independent andso the π evidence at E and F is notcorrect.
Data Analysis and Probabilistic Inference Lecture 8 Slide 15
Problems with Loops in Networks
Issue 2: Independence of MultipleParents
When only Node A is instantiatedpropagation terminates. However Cand D are not independent and so theπ evidence at E and F is not correct.
Exact propagation can still be carriedout if one of nodes B C or D isinstantiated. This will make C and Dindependent.
Data Analysis and Probabilistic Inference Lecture 8 Slide 16
Approximate Inference Methods
Spanning trees and Naive Bayesian networks can beconsidered approximate inference methods They modelthe most important dependencies, though not all.
Their performance can be improved by a number oftechniques including:
1. Node Deletion
2. Allowing Loopy belief Propagation
3. Hidden (or Latent) Node Placement
Data Analysis and Probabilistic Inference Lecture 8 Slide 17
Node Deletion
• Given a pair of highly dependent nodes it has beenfound that deleting one sometimes improves anetworks predictive performance.
• This is a surprising result from which it is difficult toinfer any general rule.
• Node deletion is something that can be testedexperimentally.
Data Analysis and Probabilistic Inference Lecture 8 Slide 18
Selective Bayesian Network
The idea here is to use only a subset of the variables.
This can be done by starting with all the variables thendeleting any suspect variable and testing forimprovement in performance. Deletion continues untilno further improvement can be found.
We can find suspect variables by testing pairs of childrenfor high conditional dependence.
Data Analysis and Probabilistic Inference Lecture 8 Slide 19
Selective Bayesian Networks
Equivalently we can add variables incrementally (independency order) and test the performance of eachnetwork.
Data Analysis and Probabilistic Inference Lecture 8 Slide 20
Why can removing variables improve theperformance
• Consider a Bayesian networkwhere the same variableappears twice.
• Clearly conditionalindependence doesn’t hold
• The network will be biased infavour of C
• Deleting C will improveperformance
The improvement will depend on the quantity ofunaccounted dependency between variables.
Data Analysis and Probabilistic Inference Lecture 8 Slide 21
Loopy Belief Propagation
Another approximate method is to include all arcsexpressing significant dependency, and allowpropagation to continue until:
• The probability distributions reach a stable state, or
• A limiting number of iterations has occurred (theremay be no termination)
Loopy belief propagation has been shown to beequivalent to a multivariate optimisation problem. It willmost likely find a local optimum. We cannot say anythingabout its accuracy.
Data Analysis and Probabilistic Inference Lecture 8 Slide 22
Hidden Nodes (or Latent Variables)
If any two children of a parent node are not conditionallyindependent, they can be separated by a hidden node:
The new noderepresents a commoncause that relates Band C. It is calledhidden because wehave no correspondingmeasured variable.
Now we look at how to obtain it statistically.
Data Analysis and Probabilistic Inference Lecture 8 Slide 23
Switch Nodes
Adding hidden nodes which act as switches can simplifycomplex networks.
Example from Neopolitan:
Data Analysis and Probabilistic Inference Lecture 8 Slide 24
Advantages of Adding Hidden Nodes
A network can always perform as well with a hiddennode as it can without:
Data Analysis and Probabilistic Inference Lecture 8 Slide 25
Advantages of Adding Hidden Nodes
A network can always perform as well with a hiddennode as it can without:
Data Analysis and Probabilistic Inference Lecture 8 Slide 26
Advantages of Adding Hidden Nodes
A network can always perform as well with a hiddennode as it can without:
Data Analysis and Probabilistic Inference Lecture 8 Slide 27
Using Hidden Nodes
In order to create a hidden node we need to:
1. decide how many states the hidden node is to have;
2. identify values for the three new link matricesintroduced.
It may be possible to obtain hidden node informationfrom an expert (eg the eyes example from lecture 2). Forexample an expert may:
1. identify a variable corresponding to the hidden node;
2. provide data for training (ie calculating the linkmatrices).
In general however this is not often possible.
Data Analysis and Probabilistic Inference Lecture 8 Slide 28
How Many States?
• We expect that the number of states of a hidden nodewill be comparable to the number of states of thenodes its is separating.
• From the previous slides we would expect the hiddennode to have at least the same number of states asits parent.
• Link matrices with too many states will have very lowprobabilities for some states, so a possible approachis to sart with a large number of states and reducethe number depending on how many low probabilitystates we have.
Data Analysis and Probabilistic Inference Lecture 8 Slide 29
Calculating the Conditional Probabilities
1. Given estimates of:P(H |A),P(B|H),P(C|H) and aset of data points [ai ,bj,ck ]
2. Use each bj,ck to computeP ′(A) from the network,calculate and accumulate anerror:
E = (P ′(A)−P(ai))2
3. Minimise E over the data setby adjusting the elements ofP(H |A),P(B|H),P(C|H)
Data Analysis and Probabilistic Inference Lecture 8 Slide 30
Calculating the Conditional Probabilities
For each conditional probabilityP(cj|hk) we need to find a valuefor:
∂E∂P(cj |hk)
Then in each epoch we updatethe conditional probabilitiesusing:
P(cj|hk)⇒ P(cj|hk)−µ∂E
∂P(cj |hk)
Gradients may be calculated analytically or numerically.A closed form equation for the gradients was developedby Chee Keong Kwoh.
Data Analysis and Probabilistic Inference Lecture 8 Slide 31
Gradient Descent and Probabilities
Gradient descent has problems when applied toprobability distributions. After one cycle of updating:• Distributions will no longer sum to 1• Individual probability values may be greater than 1
or less than 0
The conditional probability matrices must be normalisedso that the columns sum to 1. This may compromisefinding an optimal solution.
Data Analysis and Probabilistic Inference Lecture 8 Slide 32
Propagation Strategies for calculating errors
Strategies may be alternated during the optimisation andthis produces annealing behaviour:
Data Analysis and Probabilistic Inference Lecture 8 Slide 33
Hidden Nodes for Removing Loops
Suppose we build a network including all thedependencies, we can then use hidden nodes to removeany loops that were formed. In the case of the simpletriple we have seen that:
The process is to remove the least dependent link of amultiple parent.
Data Analysis and Probabilistic Inference Lecture 8 Slide 34
Reducing bigger loops
We can apply the same process to bigger loops. Modellingthe dependency between C and D we get:
The training methods still work since for anyinstantiation of A or B the probability propagation willfinish.
Data Analysis and Probabilistic Inference Lecture 8 Slide 35
Reducing bigger loops
We can continue by modelling the dependency between Band H1:
This results in a singly connected network but with twohidden nodes.
Data Analysis and Probabilistic Inference Lecture 8 Slide 36
Reducing bigger loops
We can always reduce any network a singly connectedform by this method. One possible form of the Asianetwork is:
However, the largenumber of hidden nodesmakes the method lookless attractive.
The performance willbecome increasinglydependent on thetraining data andtraining process.
Data Analysis and Probabilistic Inference Lecture 8 Slide 37
Reducing bigger loops
Could we simplify things by combining our two hiddennodes into one?:
The answer to this is very much data dependent. Clearlythe hidden node now has to model the dependencybetween B and C that comes through both the commonparent A and the common child D.
Data Analysis and Probabilistic Inference Lecture 8 Slide 38
Limitations of the Hidden Node Method
There is clearly going to be a limit to the degree to whichwe can model dependencies through hidden nodes. Asthe dependencies become more complex, either:
1. We will need many hidden nodes, or
2. The number of states in the hidden node will becomevery large
In either case we may not have enough data to train thenew network.
Data Analysis and Probabilistic Inference Lecture 8 Slide 39
Criteria for Introducing Hidden Nodes
Given a network we can measure the conditionaldependency of each pair of children given the parents.
If this is high we expect that benefits will occur fromintroducing a hidden node.
However below a certain threshold we are unlikely tobenefit from a hidden node and may choose to ignore thedependency.
Data Analysis and Probabilistic Inference Lecture 8 Slide 40
Other Hidden Node Methodologies
Other more heuristic methods have been suggested foremploying hidden nodes.
Starting with a naive network:
Data Analysis and Probabilistic Inference Lecture 8 Slide 41
Other Hidden Node Methodologies
Other more heuristic methods have been suggested foremploying hidden nodes.
Find all significant conditional dependencies:
Data Analysis and Probabilistic Inference Lecture 8 Slide 42
Other Hidden Node Methodologies
Other more heuristic methods have been suggested foremploying hidden nodes.
Model them with hidden nodes:
Data Analysis and Probabilistic Inference Lecture 8 Slide 43
Other Hidden Node Methodologies
A similar idea can be applied starting with a spanningtree;
Data Analysis and Probabilistic Inference Lecture 8 Slide 44