Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian networks Siamak Ravanbakhsh Fall 2019
Probabilistic Graphical ModelsProbabilistic Graphical ModelsStructure learning in Bayesian networks
Siamak Ravanbakhsh Fall 2019
Learning objectivesLearning objectives
why structure learning is hard?two approaches to structure learning
constraint-based methodsscore based methods
MLE vs Bayesian score
Structure learningStructure learning in BayesNets in BayesNets
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
Structure learningStructure learning in BayesNets in BayesNets
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
search over the combinatorial space, maximizing a score 2O(n )2
Structure learningStructure learning in BayesNets in BayesNets
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
search over the combinatorial space, maximizing a score
Bayesian model averagingintegrate over all possible structures
2O(n )2
Structure learningStructure learning in BayesNets in BayesNets
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
search over the combinatorial space, maximizing a score
Bayesian model averagingintegrate over all possible structures
Structure learningStructure learning in BayesNets in BayesNets
Identifiable up to I-equivalence
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
a DAG with the same set of conditional independencies (CI) I(G) = I(p )D
Structure learningStructure learning in BayesNets in BayesNets
Identifiable up to I-equivalence
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
a DAG with the same set of conditional independencies (CI) I(G) = I(p )D
Perfect MAP
Structure learningStructure learning in BayesNets in BayesNets
Identifiable up to I-equivalence
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
a DAG with the same set of conditional independencies (CI) I(G) = I(p )D
hypothesis testing
Perfect MAP
Structure learningStructure learning in BayesNets in BayesNets
Identifiable up to I-equivalence
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
a DAG with the same set of conditional independencies (CI) I(G) = I(p )D
hypothesis testing
Perfect MAP
X ⊥ Y ∣ Z?
Structure learningStructure learning in BayesNets in BayesNets
Identifiable up to I-equivalence
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
a DAG with the same set of conditional independencies (CI) I(G) = I(p )D
hypothesis testing
first attempt: a DAG that is I-map for
Perfect MAP
p D I(G) ⊆ I(p )D
X ⊥ Y ∣ Z?
minimal I-mapminimal I-map from CI test from CI test
input: IC test oracle; an orderingoutput: a minimal I-map G for i=1...n
find minimal s.t.set
X , … ,X 1 n
(X ⊥i X , … ,X −1 i−1 U ∣ U)U ⊆ {X , … ,X }1 i−1
X 1 X nX i
Pa ←X i U X ⊥ NonDesc ∣ Pa i X i X i
a DAG where removing an edge violates I-map property
minimal I-mapminimal I-map from CI test from CI testProblems:
CI tests involve many variablesnumber of CI tests is exponentiala minimal I-MAP may be far from a P-MAP
minimal I-mapminimal I-map from CI test from CI testProblems:
CI tests involve many variablesnumber of CI tests is exponentiala minimal I-MAP may be far from a P-MAP
different orderings give different graphs Example:
D,I,S,G,L(a topological ordering)
L,S,G,I,D L,D,S,I,G
Structure learningStructure learning in BayesNets in BayesNets
Identifiable up to I-equivalence
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
a DAG with the same set of conditional independencies (CI)
I(G) = I(p )D
first attempt: a DAG that is I-map for p D I(G) ⊆ I(p )D
can we find a perfect MAP with fewer IC testsinvolving fewer variables?
second attempt: a DAG that is P-map for
Perfect mapPerfect map from CI test from CI test
only up to I-equivalencethe same set of CIs
same skeletonsame immoralities
Perfect mapPerfect map from CI test from CI test
only up to I-equivalencethe same set of CIs
same skeletonsame immoralities
procedure:
1. find the undirected skeleton using CI tests2. identify immoralities in the undirected graph
Perfect mapPerfect map from CI test from CI test1. finding the undirected skeleton
observation: if X and Y are not adjacent then ORX ⊥ Y ∣ Pa X X ⊥ Y ∣ Pa Y
Perfect mapPerfect map from CI test from CI test1. finding the undirected skeleton
observation: if X and Y are not adjacent then ORX ⊥ Y ∣ Pa X X ⊥ Y ∣ Pa Y
assumption: max number of parents d
Perfect mapPerfect map from CI test from CI test1. finding the undirected skeleton
observation: if X and Y are not adjacent then ORX ⊥ Y ∣ Pa X X ⊥ Y ∣ Pa Y
assumption: max number of parents d
idea: search over all subsets of size d, and check CI above
Perfect mapPerfect map from CI test from CI test1. finding the undirected skeleton
observation: if X and Y are not adjacent then ORX ⊥ Y ∣ Pa X X ⊥ Y ∣ Pa Y
assumption: max number of parents d
idea: search over all subsets of size d, and check CI above
input: CI oracle; bound on #parents d
output: undirected skeleton
initialize H as a complete undirected graph
for all pairs for all subsets U of size (within current neighbors of )
If then remove from Hreturn H
X ,X i j
≤ d
X ⊥i X ∣j U X −i X j
X ,X i j
Perfect mapPerfect map from CI test from CI test1. finding the undirected skeleton
observation: if X and Y are not adjacent then ORX ⊥ Y ∣ Pa X X ⊥ Y ∣ Pa Y
assumption: max number of parents d
idea: search over all subsets of size d, and check CI above
input: CI oracle; bound on #parents d
output: undirected skeleton
initialize H as a complete undirected graph
for all pairs for all subsets U of size (within current neighbors of )
If then remove from Hreturn H
X ,X i j
≤ d
X ⊥i X ∣j U X −i X j
X ,X i j = O((n ) ×2 O((n − 2) )d
O(n )d+2
Perfect mapPerfect map from CI test from CI test2. finding the immoralities
potential immoralityX − Z,Y − Z ∈ H,X − Y ∈ H
YX
Z
Perfect mapPerfect map from CI test from CI test2. finding the immoralities
potential immoralityX − Z,Y − Z ∈ H,X − Y ∈ H
YX
Z
Perfect mapPerfect map from CI test from CI test2. finding the immoralities
potential immoralityX − Z,Y − Z ∈ H,X − Y ∈ H
not immorality only if
X ⊥i X ∣j U⇒ Z ∈ UYX
Z
Perfect mapPerfect map from CI test from CI test2. finding the immoralities
input: CI oracle; bound on #parents d
output: undirected skeleton
initialize H as a complete undirected graph
for all pairs for all subsets U of size (within current neighbors of )
If then remove from Hreturn H
X ,X i j
≤ d
X ⊥i X ∣j U X −i X j
X ,X i j
potential immoralityX − Z,Y − Z ∈ H,X − Y ∈ H
not immorality only if
X ⊥i X ∣j U⇒ Z ∈ U
save the U when removing X-Ysee if Z in U?
if no, then we have immorality
X Y
Z
YX
Z
Perfect mapPerfect map from CI test from CI test3. propagate the constraints
at this point: a mix of directed and undirected edges
Perfect mapPerfect map from CI test from CI test3. propagate the constraints
at this point: a mix of directed and undirected edgesadd directions using the following rules (needed to preserve immoralities / DAG structure)
until convergence
for exact CI tests, this guarantees the exact I-equivalence family
Perfect mapPerfect map from CI test from CI test3. propagate the constraints
at this point: a mix of directed and undirected edgesadd directions using the following rules (needed to preserve immoralities / DAG structure)
until convergenceExample
Ground truth DAG
for exact CI tests, this guarantees the exact I-equivalence family
Perfect mapPerfect map from CI test from CI test3. propagate the constraints
at this point: a mix of directed and undirected edgesadd directions using the following rules (needed to preserve immoralities / DAG structure)
until convergenceExample
Ground truth DAG
undirected skeleton+immoralities
for exact CI tests, this guarantees the exact I-equivalence family
Perfect mapPerfect map from CI test from CI test3. propagate the constraints
at this point: a mix of directed and undirected edgesadd directions using the following rules (needed to preserve immoralities / DAG structure)
until convergenceExample
Ground truth DAG
undirected skeleton+immoralities using rules R1,R2,R3
for exact CI tests, this guarantees the exact I-equivalence family
conditional independence (CI) testconditional independence (CI) test
how to decide from the datasetX ⊥ Y ∣ Z D
conditional independence (CI) testconditional independence (CI) test
how to decide from the datasetX ⊥ Y ∣ Z D
measure the deviance of from
conditional mututal information
statistics
p (X ∣D Z)p (Y ∣Z)D p (X,Y ∣Z)D
d (D) =I E [D(p (X,Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]Z D D D
χ2
conditional independence (CI) testconditional independence (CI) test
how to decide from the datasetX ⊥ Y ∣ Z D
measure the deviance of from
conditional mututal information
statistics
p (X ∣D Z)p (Y ∣Z)D p (X,Y ∣Z)D
d (D) =I E [D(p (X,Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]Z D D D
χ2
d (D) =χ2 ∣D∣ ∑x,y,z p (z)p (x∣z)p (y∣z)D D D
(p (x,y,z)−p (z)p (x∣z)p (y∣z))D D D D2
using frequencies in thedataset
conditional independence (CI) testconditional independence (CI) test
how to decide from the datasetX ⊥ Y ∣ Z D
measure the deviance of from
conditional mututal information
statistics
p (X ∣D Z)p (Y ∣Z)D p (X,Y ∣Z)D
d (D) =I E [D(p (X,Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]Z D D D
χ2
d (D) =χ2 ∣D∣ ∑x,y,z p (z)p (x∣z)p (y∣z)D D D
(p (x,y,z)−p (z)p (x∣z)p (y∣z))D D D D2
using frequencies in thedataset
large deviance rejects the null hypothesis (of conditional independence)
conditional independence (CI) testconditional independence (CI) test
how to decide from the datasetX ⊥ Y ∣ Z D
measure the deviance of from
conditional mututal information
statistics
p (X ∣D Z)p (Y ∣Z)D p (X,Y ∣Z)D
d (D) =I E [D(p (X,Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]Z D D D
χ2
d (D) =χ2 ∣D∣ ∑x,y,z p (z)p (x∣z)p (y∣z)D D D
(p (x,y,z)−p (z)p (x∣z)p (y∣z))D D D D2
using frequencies in thedataset
large deviance rejects the null hypothesis (of conditional independence)
d(D) > tpick a threshold
conditional independence (CI) testconditional independence (CI) test
how to decide from the datasetX ⊥ Y ∣ Z D
measure the deviance of from
conditional mututal information
statistics
p (X ∣D Z)p (Y ∣Z)D p (X,Y ∣Z)D
d (D) =I E [D(p (X,Y ∣Z)∣∣p (X∣Z)p (Y ∣Z))]Z D D D
χ2
d (D) =χ2 ∣D∣ ∑x,y,z p (z)p (x∣z)p (y∣z)D D D
(p (x,y,z)−p (z)p (x∣z)p (y∣z))D D D D2
using frequencies in thedataset
large deviance rejects the null hypothesis (of conditional independence)
d(D) > tpick a threshold
p-value is the probability of false rejection pvalue(t) = P ({D : d(D) > t} ∣ X ⊥ Y ∣ Z)
conditional independence (CI) testconditional independence (CI) test
how to decide from the datasetX ⊥ Y ∣ Z D
large deviance rejects the null hypothesis (of conditional independence)
d(D) > tpick a threshold
p-value is the probability of false rejection pvalue(t) = P ({D : d(D) > t} ∣ X ⊥ Y ∣ Z)
over all possible datasets
conditional independence (CI) testconditional independence (CI) test
how to decide from the datasetX ⊥ Y ∣ Z D
large deviance rejects the null hypothesis (of conditional independence)
d(D) > tpick a threshold
p-value is the probability of false rejection pvalue(t) = P ({D : d(D) > t} ∣ X ⊥ Y ∣ Z)
over all possible datasets
it is possible to derive the distribution of deviance measures
e.g., distributionreject a hypothesis (CI) for small p-values (.05)
χ2
.05
.95
Structure learningStructure learning in BayesNets in BayesNets
family of methods
constraint-based methodsestimate cond. independencies from the data
find compatible BayesNets
search over the combinatorial space, maximizing a score
Bayesian model averagingintegrate over all possible structures
Mutual informationMutual information
how much information does X encode about Y?
reduction in the uncertainty of X after observing Y
Mutual informationMutual information
how much information does X encode about Y?
I(X,Y ) = H(X) − H(X∣Y )
reduction in the uncertainty of X after observing Y
conditional entropy p(x)H(p(y∣x))∑x
Mutual informationMutual information
how much information does X encode about Y?
I(X,Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)
reduction in the uncertainty of X after observing Y
symmetric = I(Y ,X)
conditional entropy p(x)H(p(y∣x))∑x
Mutual informationMutual information
how much information does X encode about Y?
I(X,Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)
reduction in the uncertainty of X after observing Y
symmetric = I(Y ,X)
I(X,Y ) = p(x, y) log( )∑x,y p(x)p(y)p(x,y)
conditional entropy p(x)H(p(y∣x))∑x
Mutual informationMutual information
how much information does X encode about Y?
I(X,Y ) = H(X) − H(X∣Y ) = H(Y ) − H(Y ∣X)
reduction in the uncertainty of X after observing Y
symmetric = I(Y ,X)
= D (p(x, y)∥p(x)p(y))KL
I(X,Y ) = p(x, y) log( )∑x,y p(x)p(y)p(x,y)
positive
conditional entropy p(x)H(p(y∣x))∑x
MLE in Bayes-nets MLE in Bayes-nets mutual information formmutual information form
log-likelihood ℓ(D; θ) = log p(x ∣∑x∈D∑i i Pa ; θ )x i i∣Pa i
MLE in Bayes-nets MLE in Bayes-nets mutual information formmutual information form
log-likelihood ℓ(D; θ) = log p(x ∣∑x∈D∑i i Pa ; θ )x i i∣Pa i
= log p(x ∣∑i∑(x ,Pa )∈Di x ii Pa ; θ )x i i∣Pa i
MLE in Bayes-nets MLE in Bayes-nets mutual information formmutual information form
log-likelihood ℓ(D; θ) = log p(x ∣∑x∈D∑i i Pa ; θ )x i i∣Pa i
= log p(x ∣∑i∑(x ,Pa )∈Di x ii Pa ; θ )x i i∣Pa i
= N p (x,Pa ) log p(x ∣∑i∑x ,Pa i x iD x i i Pa ; θ )x i i∣Pa i
using the empirical distribution
MLE in Bayes-nets MLE in Bayes-nets mutual information formmutual information form
log-likelihood ℓ(D; θ) = log p(x ∣∑x∈D∑i i Pa ; θ )x i i∣Pa i
= log p(x ∣∑i∑(x ,Pa )∈Di x ii Pa ; θ )x i i∣Pa i
= N p (x,Pa ) log p(x ∣∑i∑x ,Pa i x iD x i i Pa ; θ )x i i∣Pa i
use MLE estimate ℓ(D, θ ) =∗ N p (x ,Pa ) log p (x ∣∑i∑x ,Pa i x iD i xi D i Pa )xi
using the empirical distribution
MLE in Bayes-nets MLE in Bayes-nets mutual information formmutual information form
log-likelihood ℓ(D; θ) = log p(x ∣∑x∈D∑i i Pa ; θ )x i i∣Pa i
= log p(x ∣∑i∑(x ,Pa )∈Di x ii Pa ; θ )x i i∣Pa i
= N p (x,Pa ) log p(x ∣∑i∑x ,Pa i x iD x i i Pa ; θ )x i i∣Pa i
use MLE estimate ℓ(D, θ ) =∗ N p (x ,Pa ) log p (x ∣∑i∑x ,Pa i x iD i xi D i Pa )xi
= N p (x ,Pa ) log + log p (x )∑i∑x ,Pa i x iD i x i
(p (x )p (Pa )D i D x i
p (x ,Pa )D i x iD i )
using the empirical distribution
MLE in Bayes-nets MLE in Bayes-nets mutual information formmutual information form
log-likelihood ℓ(D; θ) = log p(x ∣∑x∈D∑i i Pa ; θ )x i i∣Pa i
= log p(x ∣∑i∑(x ,Pa )∈Di x ii Pa ; θ )x i i∣Pa i
= N p (x,Pa ) log p(x ∣∑i∑x ,Pa i x iD x i i Pa ; θ )x i i∣Pa i
use MLE estimate ℓ(D, θ ) =∗ N p (x ,Pa ) log p (x ∣∑i∑x ,Pa i x iD i xi D i Pa )xi
= N p (x ,Pa ) log + log p (x )∑i∑x ,Pa i x iD i x i
(p (x )p (Pa )D i D x i
p (x ,Pa )D i x iD i )
using the definition of mutual information = N I (X ,Pa ) −∑i D i X iH (X )D i
using the empirical distribution
Optimal solution for Optimal solution for treestreeslikelihood score ℓ(D, θ ) =∗ N I (X ,Pa ) −∑i D i X i
H (X )D i
Optimal solution for Optimal solution for treestreeslikelihood score ℓ(D, θ ) =∗ N I (X ,Pa ) −∑i D i X i
H (X )D i
does not depend on structure
Optimal solution for Optimal solution for treestreeslikelihood score ℓ(D, θ ) =∗ N I (X ,Pa ) −∑i D i X i
H (X )D i
does not depend on structure
I (X ,X )D i j
Optimal solution for Optimal solution for treestreeslikelihood score ℓ(D, θ ) =∗ N I (X ,Pa ) −∑i D i X i
H (X )D i
structure learning algorithms use mutual information in the structure search:
Chow-Liu algorithm: find the max-spanning tree: edge-weights = mutual information
add direction to edges later
make sure each node has at most one parent (i.e., no v-structure)
does not depend on structure
I (X ,X )D i j
I (X ,X ) =D j i I (X ,X )D i j
Bayesian about both structure and parameters
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
P (G∣D) ∝ P (D∣G)P (G)
G θ
Bayesian about both structure and parameters
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
P (G∣D) ∝ P (D∣G)P (G)
G θlog
score (G,D) =B log P (D∣G) + log P (G)
Bayesian about both structure and parameters
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
P (G∣D) ∝ P (D∣G)P (G)
G θ
P (D∣θ,G)P (θ ∣∫θ∈Θ G
G)dθ marginal likelihood for a structure
logscore (G,D) =B log P (D∣G) + log P (G)
G
Bayesian about both structure and parameters
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
P (G∣D) ∝ P (D∣G)P (G)
G θ
P (D∣θ,G)P (θ ∣∫θ∈Θ G
G)dθ marginal likelihood for a structure
assuming local and global parameter independence
factorizes to the marginal likelihood of each node
logscore (G,D) =B log P (D∣G) + log P (G)
G
Bayesian about both structure and parameters
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
P (G∣D) ∝ P (D∣G)P (G)
G θ
P (D∣θ,G)P (θ ∣∫θ∈Θ G
G)dθ marginal likelihood for a structure
assuming local and global parameter independence
factorizes to the marginal likelihood of each node
logscore (G,D) =B log P (D∣G) + log P (G)
G
for Dirichlet-multinomial has closed form
Bayesian about both structure and parameters
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
P (G∣D) ∝ P (D∣G)P (G)
G θ
P (D∣θ,G)P (θ ∣∫θ∈Θ G
G)dθ marginal likelihood for a structure
assuming local and global parameter independence
factorizes to the marginal likelihood of each node
logscore (G,D) =B log P (D∣G) + log P (G)
G
for Dirichlet-multinomial has closed form
score (G,D) ≈B ℓ(D, θ ) −∗G log(∣D∣)K2
1Bayesian Information Criterion (BIC)
for large sample size
any exp-family member
Bayesian about both structure and parameters
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
P (G∣D) ∝ P (D∣G)P (G)
G θ
P (D∣θ,G)P (θ ∣∫θ∈Θ G
G)dθ marginal likelihood for a structure
assuming local and global parameter independence
factorizes to the marginal likelihood of each node
logscore (G,D) =B log P (D∣G) + log P (G)
G
for Dirichlet-multinomial has closed form
score (G,D) ≈B ℓ(D, θ ) −∗G log(∣D∣)K2
1Bayesian Information Criterion (BIC)
#parameters
for large sample size
any exp-family member
Bayesian about both structure and parameters
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
P (G∣D) ∝ P (D∣G)P (G)
G θ
P (D∣θ,G)P (θ ∣∫θ∈Θ G
G)dθ marginal likelihood for a structure
assuming local and global parameter independence
factorizes to the marginal likelihood of each node
logscore (G,D) =B log P (D∣G) + log P (G)
G
for Dirichlet-multinomial has closed form
score (G,D) ≈B ℓ(D, θ ) −∗G log(∣D∣)K2
1Bayesian Information Criterion (BIC)
#parameters
for large sample size
any exp-family member
Akaike Information Criterion (AIC) ℓ(D, θ ) −∗G K2
1
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
Example
G 1
G 2
= ∣D∣
The Bayesian score is biased towards simpler structures
Bayesian Score Bayesian Score for BayesNetsfor BayesNets
Example The Bayesian score is biased towards simpler structures
= ∣D∣
data sampled from ICU alarm Bayesnet
Bayesian score of the true model (509 params.)simplified model (359 params)
simplified model (214 params)
Structure searchStructure search
is NP-hardarg max Score(D,G)G
use heuristic search algorithms (discussed for MAP inference)
Structure searchStructure search
is NP-hardarg max Score(D,G)G
use heuristic search algorithms (discussed for MAP inference)
local search using: edge additionedge deletionedge reversal
Structure searchStructure search
is NP-hardarg max Score(D,G)G
use heuristic search algorithms (discussed for MAP inference)
local search using: edge additionedge deletionedge reversal
O(N )2 possible moves
Structure searchStructure search
is NP-hardarg max Score(D,G)G
use heuristic search algorithms (discussed for MAP inference)
local search using: edge additionedge deletionedge reversal
O(N )2 possible moves
collect sufficient statistics (frequencies)estimate the score
Structure searchStructure search
is NP-hardarg max Score(D,G)G
use heuristic search algorithms (discussed for MAP inference)
local search using: edge additionedge deletionedge reversal
use the decomposition of the score
O(N )2 possible moves
collect sufficient statistics (frequencies)estimate the score
Structure searchStructure search
is NP-hardarg max Score(D,G)G
use heuristic search algorithms (discussed for MAP inference)
local search using: edge additionedge deletionedge reversal
use the decomposition of the score
O(N )2 possible moves
collect sufficient statistics (frequencies)estimate the score
example ICU-alarm network
SummarySummary
Structure learning is NP-hardMake assumptions to simplify:
constraint-based methods:limit the max number of parents
rely on CI tests
identifies the I-equivalence class