Graphical Models A Brief Introduction Reference: Pattern Recognition and Machine Learning by C.M. Bishop, Springer Chapter 8.2 https://www.microsoft.com/en‐us/research/wp‐content/uploads/2016/05/Bishop‐PRML‐sample.pdf 1
Graphical ModelsA Brief Introduction
Reference: Pattern Recognition and Machine Learningby C.M. Bishop, SpringerChapter 8.2
https://www.microsoft.com/en‐us/research/wp‐content/uploads/2016/05/Bishop‐PRML‐sample.pdf
1
ProbabilisticModel
Real WorldData
P(Data | Parameters)
2
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
3
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
Generative Model, Probability
Inference, Statistics
4
Notation and Definitions• X is a random variable
– Lower‐case x is some possible value for X– “X = x” is a logical proposition: that X takes value x– There is uncertainty about the value of X
• e.g., X is the Hang Seng index at 5pm tomorrow
• p(X = x) is the probability that proposition X=x is true– often shortened to p(x)
• If the set of possible x’s is finite, we have a probability distribution and p(x) = 1
• If the set of possible x’s is infinite, p(x) is a density function, and p(x) integrates to 1 over the range of X
5
Multiple Variables• p(x, y, z)
– Probability that X=x AND Y=y AND Z =z– Possible values: cross‐product of X Y Z
– e.g., X, Y, Z each take 10 possible values• x,y,z can take 103 possible values• p(x,y,z) is a 3‐dimensional array/table
– Defines 103 probabilities• Note the exponential increase as we add more variables
– e.g., X, Y, Z are all real‐valued• x,y,z live in a 3‐dimensional vector space• p(x,y,z) is a positive function defined over this space, integrates to 1
6
Conditional Probability• p(x | y, z)
– Probability of x given that Y=y and Z = z– Could be
• hypothetical, e.g., “if Y=y and if Z = z”• observational, e.g., we observed values y and z
– can also have p(x, y | z), etc– “all probabilities are conditional probabilities”
• Computing conditional probabilities is the basis of many prediction and learning problems, e.g.,– p(DJI tomorrow | DJI index last week)– expected value of [DJI tomorrow | DJI index next week)– most likely value of parameter given observed data
7
Computing Conditional Probabilities• Variables A, B, C, D
– All distributions of interest related to A,B,C,D can be computed from the full joint distribution p(a,b,c,d)
• Examples, using the Law of Total Probability– p(a) = {b,c,d} p(a, b, c, d) – p(c,d) = {a,b} p(a, b, c, d)– p(a,c | d) = {b} p(a, b, c | d)
where p(a, b, c | d) = p(a,b,c,d)/p(d)• These are standard probability manipulations: however, we
will see how to use these to make inferences about parameters and unobserved variables, given data
8
Two Practical Problems
(Assume for simplicity each variable takes K values) • Problem 1: Computational Complexity
– Conditional probability computations scale as O(KN) • where N is the number of variables being summed over
• Problem 2: Model Specification– To specify a joint distribution we need a table of O(KN) numbers
– Where do these numbers come from?
9
Two Key Ideas
• Problem 1: Computational Complexity– Idea: Graphical models
• Structured probability models lead to tractable inference
• Problem 2: Model Specification– Idea: Probabilistic learning
• General principles for learning from data
10
Conditional Independence• A is conditionally independent of B given C iff
p(a | b, c) = p(a | c)(also implies that B is conditionally independent of A given C)
• In words, B provides no information about A, if value of C is known
• Example:– a = “reading ability”– b = “height”– c = “age”
• Note that conditional independence does not imply marginal independence
11
Graphical Models
• Represent dependency structure with a directed graph– Node <‐> random variable– Edges encode dependencies
• Absence of edge ‐> conditional independence– Directed and undirected versions
• Why is this useful?– A language for communication– A language for computation
12
Examples of 3‐way Graphical Models
A CB Marginal Independence:p(A,B,C) = p(A) p(B) p(C)
13
Examples of 3‐way Graphical Models
A
CB
Conditionally independent effects:p(A,B,C) = p(B|A)p(C|A)p(A)
B and C are conditionally independentGiven A
e.g., A is a disease, and we model B and C as conditionally independentsymptoms given A
14
Examples of 3‐way Graphical Models
A B
C
Independent Causes:p(A,B,C) = p(C|A,B)p(A)p(B)
15
Examples of 3‐way Graphical Models
A CB Markov dependence:p(A,B,C) = p(C|B) p(B|A)p(A)
16
Directed Graphical Models
A B
C
p(A,B,C) = p(C|A,B)p(A)p(B)
17
Directed Graphical Models
A B
C
In general,p(X1, X2,....XN) = p(Xi | parents(Xi ) )
p(A,B,C) = p(C|A,B)p(A)p(B)
18
Directed Graphical Models
A B
C
• Probability model has simple factored form
• Directed edges => direct dependence
• Absence of an edge => conditional independence
• Also known as belief networks, Bayesian networks, causal networks
In general,p(X1, X2,....XN) = p(Xi | parents(Xi ) )
p(A,B,C) = p(C|A,B)p(A)p(B)
19
Reminders from Probability….
• Law of Total ProbabilityP(a) = b P(a, b) = b P(a | b) P(b)
– Conditional version:P(a|c) = b P(a, b|c) = b P(a | b , c) P(b|c)
• Factorization or Chain Rule– P(a, b, c, d) = P(a | b, c, d) P(b | c, d) P(c | d) P (d), or
= P(b | a, c, d) P(c | a, d) P(d | a) P(a), or= …..
20
Probability Calculations on Graphs• General algorithms exist ‐ beyond trees
– Complexity is typically O(m (number of parents ) )(where m = arity of each node)
– If single parents (e.g., tree), ‐> O(m)– The sparser the graph the lower the complexity
• Technique can be “automated”– i.e., a fully general algorithm for arbitrary graphs– For continuous variables:
• replace sum with integral– For identification of most likely values
• Replace sum with max operator
21
ProbabilisticModel
Real WorldData
P(Data | Parameters)
P(Parameters | Data)
Generative Model, Probability
Inference, Statistics
22
The Likelihood Function• Likelihood = p(data | parameters)
= p( D | ) = L ()
• Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters
• Details– Constants that do not involve can be dropped in defining L ()
– Often easier to work with log L ()
23
Comments on the Likelihood Function
• Constructing a likelihood function L () is the first step in probabilistic modeling
• The likelihood function implicitly assumes an underlying probabilistic model M with parameters
• L () connects the model to the observed data
• Graphical models provide a useful language for constructing likelihoods
24
Binomial Likelihood• Binomial model
– N memoryless trials, 2 outcomes– probability of success at each trial
• Observed data– r successes in n trials – Defines a likelihood:
L() = p(D | ) = p(successes) p(non‐successes)= r (1‐) n‐r
25
Binomial Likelihood Examples
26
Graphical Models
• Left – data points are conditionally independent given
• Right – plate notation (same model as left)repeated nodes are inside a box (plate)number in lower right hand corner , specifies the number of repetitions of the node
27
• Represent using a graphical model:
• Assume each data case was generated independently but from the same distribution
• Data cases are only independent conditional on the parameters
• Marginally, the data cases are dependent• The order in which the data cases arrive makes
no difference to the benefits about (all orderings have same sufficient statistics) data is exchangeable
28
Graphical Models
• Avoid visual clutter:use a form of syntactic sugar, called plates
• Draw a little box around the repeated variables• With the convention that nodes within the box is
repeated when the model is unrolled• Bottom right corner of the box: number of copies
or repetitions• The corresponding joint distribution has the form:
29
Plate Notation
Multinomial Likelihood• Multinomial model
– N memoryless trials, K outcomes– Probability vector for outcomes at each trial
• Observed data– nj successes in n trials – Defines a likelihood:
30
Graphical Model for Multinomial
w1
= [ p(w1), p(w2),….. p(wk) ]
w2 wn
Parameters
Observed data
31
“Plate” Notation
wi
i=1:n
Data = D = {w1,…wn}
Model parameters
Plate (rectangle) indicates replicated nodes in a graphical model
Variables within a plate are conditionally independent manner given parent
32
Learning in Graphical Models
wi
i=1:n
Data = D = {w1,…wn}
Model parameters
• Can view learning in a graphical model as computing the most likely value of the parameter node given the data nodes
33
Maximum Likelihood (ML) Principle
wi
i=1:n
L () = p(Data | ) = p(yi | )
Maximum Likelihood: ML = arg max{ Likelihood() }
Select the parameters that make the observed data most likely
Data = {w1,…wn}
Model parameters
34
The Bayesian Approach to Learning
wi
i=1:n
Fully Bayesian:p( | Data) = p(Data | ) p() / p(Data)
Maximum A Posteriori:MAP = arg max{ Likelihood() x Prior() }
Prior() = p( )
35
Summary of Bayesian Learning• Can use graphical models to describe relationships between
parameters and data• P(data | parameters) = Likelihood function• P(parameters) = prior
– In applications such as text mining, prior can be “uninformative”, i.e., flat
– Prior can also be optimized for prediction (e.g., on validation data)
• We can compute P(parameters | data, prior)or a “point estimate” (e.g., posterior mode or mean)
• Computation of posterior estimates can be computationally intractable – Monte Carlo techniques often used
36