Copyright (c) 2002 by SNU CSE Biointelligence Lab. 1 SURVEY: Foundations of Bayesian Networks O, Jangmin 2002/10/29 Last modified 2002/10/29.

Copyright (c) 2002 by SNU CSE Biointelligence Lab. 1

SURVEY: Foundations of Bayesian Networks

O, Jangmin

2002/10/29

Last modified 2002/10/29


Contents

• From DAG to Junction TreeFrom DAG to Junction Tree• From Elimination Tree to Junction Tree• Junction Tree Algorithms• Learning Bayesian Networks


Typical Example of DAG

A

B C

F

DG

Simple DAG


1. Topological Sort

Algorithm 4.1 [Topological sort]• Begin with all vertices unnumbered.• Set counter i = 1.• While any vertices remain:

– Select any vertex that has no parents;– number the selected vertex as i;– delete the numbered vertex and all its adjacent edges from

the graph;– increment i by 1.

Objective: acquiring well-orderingWell-ordering: predecessors of any node have lower number than .


1. Topological Sort (1)

A

B C

F

DG

Simple DAG

1



A

B C

F

DG

Simple DAG

1

2



A

B C

F

DG

Simple DAG

1

2 3



A

B C

F

DG

Simple DAG

1

2 3

4



A

B C

F

DG

Simple DAG

1

2 3

4

5



A

B C

F

DG

Simple DAG

1

2 3

4

5

6


2. Moral Graph

• Making moral graph of DAG– Add undirected edge between the nodes which

have same child.– Remove directions


2. Moral Graph (1)

A

B C

F

DG

Simple DAG

1

2 3

4

5

6


2. Moral Graph (2)

A

B C

F

DG

Simple DAG

1

2 3

4

5

6


Junction tree

• Definition– Tree from nodes C1, C2,...

– Intersection of C1 and C2 is contained in every node on path between C1 and C2.

• Corollaries– Decomposable, chordal, junction tree of cliques,

perfect numbering: all are equal in undirected graph.

Perfect numbering: ne(vj) {v1, ..., vj-1} induce complete subgraph.


3. Maximum Cardinality Search (1)

Algorithm 4.9 [Maximum Cardinality Search]• Set Output := ‘G is chordal’.• Set counter i := 1.• Set L = .• For all v V, set c(v) := 0.• While L V:

– Set U := V \ L.– Select any vertex v maximizing c(v) over v V and label it i.– If vi :=ne(vi) L is not complete in G:

Set Output :=‘G is not chordal’.– Otherwise, set c(w) = c(w) + 1 for each vertex w ne(vi) U.– Set L = L {vi}.– Increment i by 1.

• Report Output.



A

B C

F

DG

Simple DAG



A

B C

F

DG

1, ={}

..

.



A

B C

F

DG

1, =

..

..

2, ={A}

.



A

B C

F

DG

1, =

..

2, ={A}

..

3, ={A, B}



A

B C

F

DG

1, =

2, ={A}

..

3, ={A, B}

4, ={A, B}



A

B C

F

DG

1, =

2, ={A}

.

3, ={A, B}

4, ={A, B}

5, ={B, C}



A

B C

F

DG

1, =

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}



A

B C

F

DG

1, =

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

Output = “G is chordal”


4. Cliques of Chordal Graph (1)

Algorithm 4.11 [Finding the Cliques of a Chordal Graph]• From numbering (v1,..., vk) obtained by maximum cardinality s

earch i = cardinality of vi

• Make ladder nodes. i = ladder node if i = k

or i = ladder node if i < k and i+1 < 1 + i

• Define cliques– Cj = {j} j

C1, C2... Posess RIP (running intersection property).


4. Cliques of Chordal Graph (2)

A

B C

F

DG

1, =

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}


Running Intersection Property

• RIP : definition– Given (C1, C2, ..., Ck),– For all 1 < j k, there is an i < j such that Cj (C1 ... Cj-1) Ci.


5. Junction Tree Construction (1)

Algorithm 4.8 [Junction Tree Construction]• From the cliques (C1, ..., Cp) of a chordal graph ordered with

RIP,• Associate a node of the tree with each clique Cj.

• For j = 2, ..., p, add an edge between Cj and Ci where i is any one value in {1, ..., j-1} such that Cj (C1 ... Cj-1) Ci.



A

B C

F

DG

1, =

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}

ABC

ABD

BCF

FG

C1

C2

C3

C4



A

B C

F

DG

1, =

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}

ABC

ABD

BCF

FG

C1

C2

C3

C4



A

B C

F

DG

1, =

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}

ABC

ABD

BCF

FG

C1

C2

C3

C4



A

B C

F

DG

1, =

2, ={A} 3, ={A, B}

4, ={A, B}

5, ={B, C}

6, ={F}

C1 = {A, B, C}

C2 = {A, B, D}

C3 = {B, C, F}

C4 = {F, G}

ABC

ABD

BCF

FG

C1

C2

C3

C4


Contents

• From DAG to Junction Tree• From Elimination Tree to Junction From Elimination Tree to Junction

TreeTree• Junction Tree Algorithms• Learning Bayesian Networks


Triangulation (1)

• When need triangulation?– If MCS (Maximum Cardinality Search)

failed.

• Triangulation– introduces Fill-in.– produces perfect numbering.

• Optimal triangulation: NP-hard– Size of each cliques matters...


Triangulation (2)

Algorithm 4.13 [One-step Look Ahead Triangulation]• Start with all vertices unnumbered, set counter i := k.• While there are still some unnumbered vertices:

– Select an unnumbered vertex v to optimize the criterion c(v). or– Select v = (i) [ is an order].– Label it with the number i.– Form the set Ci consisting of vi and its unnumbered neighbours.

– Fill in edges where none exist between all pairs of vertices in Ci.

– Eliminate vi and decrement i by 1.


Triangulation (3)

A

B C

F

DG

= (A,B,C,D,F,G)

6, C6 = {F, G}


Triangulation (4)

A

B C

F

DG

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}


Triangulation (5)

A

B C

F

DG

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}

4, C4 = {A,B,D}


Triangulation (6)

A

B C

F

DG

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}

4, C4 = {A,B,D}

3, C3 = {A,B,C}


Triangulation (7)

A

B C

F

DG

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}

4, C4 = {A,B,D}

3, C3 = {A,B,C}

2, C2 = {A,B}


Triangulation (8)

A

B C

F

DG

= (A,B,C,D,F,G)

6, C6 = {F, G}

5, C5 = {B,C,F}

4, C4 = {A,B,D}

3, C3 = {A,B,C}

2, C2 = {A,B}

1, C1 = {A} Elimination set• Cj contains vj.

• vj Cl for all l < j.

• (C1,..., Ck) has RIP.• The cliques of the triangulat

ed graph G’ are contained in (C1,..., Ck).


Elimination Tree Construction (1)

Algorithm 4.14 [Elimination Tree Construction]• Associate a node of the tree with each set Ci.

• For j = 1, ..., k, if Cj contains more than one vertex, add an edge between Cj and Ci where i is the largest index of a vertex in Cj \ {vj}



A:

B:A C:AB

F:BC

D:AB

G:FC6

C5

C4

C3

C2

C1



A:

B:A C:AB

F:BC

D:AB

G:FC6

C5

C4

C3

C2

C1



A:

B:A C:AB

F:BC

D:AB

G:FC6

C5

C4

C3

C2

C1



A:

B:A C:AB

F:BC

D:AB

G:FC6

C5

C4

C3

C2

C1



A:

B:A C:AB

F:BC

D:AB

G:FC6

C5

C4

C3

C2

C1



A:

B:A C:AB

F:BC

D:AB

G:FC6

C5

C4

C3

C2

C1


From etree to jtree (1)

Lemma 4.16– Let C1,..., Ck be a sequence of sets with RIP

– Assume that Ct Cp for some t p and that p is minimal with this property for fixed t. Then:

(i) If t > p, then C1, ..., Ct-1, Ct+1, ..., Ck has the running intersection property

(ii) If t < p, then C1,..., Ct-1, Cp, Ct+1, ..., Cp-1, Cp+1,..., Ck has the RIP.

Simple removal of redundant elimination set might lead to destroy RIP.



A:

B:A C:AB

F:BC

D:AB

G:FC6

C5

C4

C3

C2

C1

Condition (ii): t = 1, p = 2

B:A C:AB

F:BC

D:AB

G:FC6

C5

C4

C3

C2



Condition (ii): t = 2, p = 3

B:A C:AB

F:BC

D:AB

G:FC6

C5

C4

C3

C2

C:AB

F:BC

D:AB

G:FC6

C5

C4

C3


MST for making jtree (1)

Algorithm• From Elimination set (C1, ..., Ck)

• Remove redundant Cis• Make junction graph.

– If |Ci Cj | > 0 add edge between Ci and Cj.

– Set weight of the edge as |Ci Cj |.

• Construct MST (Maximum Weight Spanning Tree)

The resulting tree is junction tree. Also the clique set has RIP.



ABC

BCFABD

FG

2 2

1

1

ABC

BCFABD

FG

2 2

1

Junction graph MST

C1

C2

C3

C4



• Optimal jtree (for a fixed elimination ordering)– cost of edge e = (v, w)

– Use cost of edge to break tie when constructing MST. (minimum preferred)

on. can take valuesdiscrete of # :

)(

ii

vi iv

wvwv

Xq

qq

qqqe


Contents

• From DAG to Junction Tree• From Elimination Tree to Junction Tree• Junction Tree AlgorithmsJunction Tree Algorithms• Learning Bayesian Networks


Collect phase

jji

jij μ

)(childjkjkj Sμ

Ck

Cj

Ci Ci’

• From leaf to root

separator

projection

Initial potential

Updated potential


Distribute phase

• From root to leaf j

* contains marginal distribution of clique j.

ji

ijjijjk

iijchildijiij

jkjj

SSμμ

μ

*

'),(''

*

Ck

Cj

Ci Ci’


Contents

• From DAG to Junction Tree• From Elimination Tree to Junction Tree• Junction Tree Algorithms• Learning Bayesian NetworksLearning Bayesian Networks


Learning Paradigm

• Known structure or unknown structure• Full observability or partial observability• Frequentist or Bayesian


Ks, Fo, Fr (1)

• Given training set D = {D1, ..., DM}

• MLE of parameters of each CPD– MLE (Maximum likelihood Estimates)– CPD (Conditional Probability Distribution)

M

m

n

i

M

mmiim DXPaXPGDL

1 1 1

)),(|(log)|Pr(log

Decomposition, for each node# of nodes

# of data


Ks, Fo, Fr (2)

• Multinomial distributions– , for tabular CPD– Log-likelihood

– MLE

))(|(def

jXPakXP iiijk

ijkijk

ijk

i m kjijkijkm

i m kj

Iijk

N

I

L ijkm

log

log

log

,

,)|)(,(

def

miiijkm DjXPakXII

m

miiijk DjXPakXIN )|)(,(def

' '

ˆ

k ijk

ijkijk N

N constraint: ji

k ijk , allfor 1


Ks, Fo, Fr (3)

• MLE of Multinomial distr.– Constrained optimization

ij k

ijkijijkijk

ijkNO )1(log

ijijk

ijk

ijk

N

d

dO

ijkijijkN

k

ijkijk

ijkN

ijk

ijkN

''

ˆ

kijk

ijkijk N

N

Derivatives of ijk

Setting Derivatives of ijk zero


Ks, Fo, Fr (4)

• Conditional linear Gaussian distributions


Ks, Fo, Ba (1)

• Frequentist: point estimation• Bayesian: distributional estimation


Ks, Fo, Ba (2)

• Multinomial distributions– Two assumptions on prior

• Global independence:

• Local independence:

– Global independence + likelihood equivalence leads to Dirichlet prior: Conjugate prior for multinomial

},...,1,,...,1,{ ,)(1 iiijki

n

i i rkqjP

},...,1,{ ,)(1 iijkij

q

j iji rkP i


Ks, Fo, Ba (3)

• Remark on Bayesian– P(|D) P(D| )*P()

– Conjugate priors• Posterior has same form with prior distribution.• Many exponential family belongs to conjugate

priors.

PosteriorLikelihood

Prior


Ks, Fo, Ba (4)

• Multinomial distributions– Dirichlet prior on tabular CPDs

ij: multinomial r.v. with ri possible values

• Posterior distribution

• Posterior mean

))(|( jXPaXP iiij

),...,(~ 1 iijrijij Dirichlet

i

i

ijk

r

k ijrijijkijij B

P1 1

1

),...,(

1)|(

1

1 ),...,(

k k

k kB

)!1()( nn

),...,(~| 11 ii ijrijrijijij NNDirichletD

ir

l ijlijl

ijkijkijk

N

NDE

1

]|[


Ks, Fo, Ba (5)

• Dirichlet distribution– Hyper parameter ijk

• Positive number • Pseudo count• # of imaginary cases ijk - 1

– Posterior distribution• Combined count between pseudo count and # of obser

ved data• Simple sum

),...,(~ 1 iijrijij Dirichlet

),...,(~| 11 ii ijrijrijijij NNDirichletD


Ks, Fo, Ba (6)

• Gaussian distributions


Ks, Po, Fr (1)

• Log likelihood

• Not decomposable into a sum of local terms, one per node– EM algorithm

m hm

mm

DVhHP

DPL

),(log

)(loghidden

visible (observed)


Ks, Po, Fr (2)

• EM algorithm– From Jensen’s inequality

1),log()log( j

j jjjj

jj yy

m hmm

m hmm

m h m

mm

m h m

mm

m hm

VhqVhqVhHPVhq

Vhq

VhHPVhq

Vhq

VhHPVhq

VhHPL

)|(log)|(),(log)|(

)|(

),(log)|(

)|(

),()|(log

),(log

1)|( h mVhqconstraint:


Ks, Po, Fr (3)

– Maximizing w.r.t. q (E-step)

m hmmh

m hmm

m hmhm

Vhq

VhqVhqVHPVhqO

))|(1(

)|(log)|(),(log)|(

mhmmhm

VhqVHPVhdq

dO 1)|(log),(log)|(

mhe

VHPVhq mh

m

1

),()|(

h

mhh

m VHPe

Vhqmh

),(1

)|( 1

)(),(1m

hmh VPVHPe mh

)|()|( mm VhPVhq


Ks, Po, Fr (4)

– Maximizing w.r.t (M-step)• After q is maximized to p(h|Vm)• Maximizing Expected complete-data log-likelihood

• Iteration until convergence– E-step

• Calculate expected complete-data log-likelihood– M-step

• Get * maximizing expected complete-data log-likelihood

m h

mm VhHPVhpQ )'|,(log),|()|'(

)|'(maxarg*'

Q


Ks, Po, Fr (5)

• Multinomial distribution– E-step

– M-step

ijk

ijkijkNEQ 'log][)|'( ijkijk

ijkNL log

)|)(,(def

miiijkm DjXPakXII

m

miiijk DjXPakXIN )|)(,(def

mmiiijk DjXPakXPNE ),|)(,(][

)|'(maxarg'

Q

''][

][

kijk

ijkijk NE

NE


Ks, Po, Ba (1)

• Gibbs sampling: stochastic version of EM• Variational Bayes: P(, H|V) q(|V)q(H|V)


Us, Fo, Fr (1)

• Issues– Hypothesis space– Evaluation function– Search algorithm


Us, Fo, Fr (2)

• Search space– DAG

• # of DAGs ~ O(2n^2)• 10 nodes ~ O(1018) DAGs• Finding optimal DAG: doomed to failure


Us, Fo, Fr (3)

• Search algorithm– Local search

• Operators: adding, deleting, reversing a single arcChoose G somehow

While not convergedFor each G’ in nbd(G)

Compute score(G’)G* := arg maxG’ score(G’)

If score(G*) > score(G)then G :=G*

else converged := true Psedo-code for hill-climbing. nbd(G) is the neighborhood of G, i.e., the

models that can be reached by applying a single local change operator.


Us, Fo, Fr (4)

• Search algorithm– PC algorithm

• Starts with fully connected undirected graph• CI (conditional independence) test

– If X Y|S, arc between X and Y is removed.


Us, Fo, Fr (5)

• Scoring function– MLE selects fully connected graph.– score(G) P(D|G)P(G)

– Automatically penalizing effect on complex model.• has more parameters.• Not much probability mass to the space where data act

ually lies.

)(

)()|()|( model MAP

DP

GPGDPDGP

penalizing complex models

)|(),|()|()(score GPGDPGDPG


Us, Fo, Fr (6)

• Scoring function– Under global independences, and

conjugate priors

– Integration at closed form

n

iii

n

iiiii

XXPa

PXPaXPGDPi

1

def

1

)),((score

)()),(|()|(

Decomposition as factored form


Us, Fo, Fr (7)

• Scoring function– Under not conjugate priors: approximation– Laplace approximation: BIC (Bayesian Information

Criterioin)

– Case of multinomial distribution

Md

GDPGDP G log2

)ˆ,|(log)|(log

dim. of the model

ML estimate of params.

Md

N

Md

DXPaXPG

i

i jkijkijk

im

i miii

log2

log

log2

),ˆ),(|(log)(scoreBIC


Us, Fo, Fr (8)

• Scoring function– Advantage of decomposed score– Marginal likelihood at most two different

terms in single link mismatched graphs.• Ex) G1:X1X2 X3 X4, G2:X1 X2X3 X4

),(score),(score),(score)(score

),(score)(score),,(score)(score

)|(

)|(

4332211

4333211

1

2

XXXXXXX

XXXXXXX

GDP

GDP


Us, Fo, Fr (9)

• Scoring function– Marginal likelihood for the multinomial distributio

n with Dirichlet prior – Bayesian Dirichlet (BD) score

n

i

q

j

r

k

Nijk

i iijkGDPGDP

1 1 1

),|()|(

ii

i

i

ii

r

k ijk

ijkijkn

i

q

j ijij

ij

n

i

q

j ijrij

ijrijrijij

N

N

B

NNBGDP

11 1

1 1 1

11

)(

)(

)(

)(

),...,(

),...,()|(

posterior mean


Us, Fo, Ba (1)

• Posterior over all models is intractable– Focusing on some features

• Bayesian model averaging

• Needs to calculate P(G|D)

– Solution MCMC: Metropolis-Hastings algorithm• Only need to ratio R. Integration is avoided.

G

GfDGPDfP )()|()|( f(G)=1 if G contains a certain edge

')'()'|(

)()|()|(

GGPGDP

GPGDPDGP

Integration is intractable.

)|(

)|(

)(

)(

)|(

)|(

1

2

1

2

1

2

GDP

GDP

GP

GP

DGP

DGP


Us, Fo, Ba (2)

• Calculation of P(G|D)– Sampling GChoose G somehowWhile not converged

Pick a G’ u.a.r. from nbd(G)Compute R = P(G’|D)q(G|G’)/P(G|D)q(G’|G)Sample u ~ uniform(0,1)If u < min{1, R}

then G := G’

Psedo-code for MC3 algorithm. u.a.r. means uniformly at random.


Us, Po, Fr (1)

• Partially observable– Computation of marginal likelihood:

Intractable– Not decomposable to the product of local

terms

– Solutions• Approximating the marginal likelihood• Structural EM

Z

GPGZVPGVP

)|(),|,()|(

hidden variables


Us, Po, Fr (2)

• Approximating the marginal likelihood– Candidate’s method

),|(

)|(),|()|(

*

**

GDP

GPGDPGDP

G

GG

from Gibbs sampling

from BN’s inference algorithm

trivial

MLE of params.


Us, Po. Fr (3)

• Structural EM– Idea: decomposition of expected complete-

data log-likelihood (BIC-score)– Search inside EM

• (EM inside Search is high cost process)

Md

NG i

i jkijkijk log

2log)(BICscore

Md

NG i

i jkijkijk log

2ˆlog)(EBICscore

MLE of params.

m

miiijk DjXPakXPN ),|)(,(


Us, Po, Ba (1)

• Combined MCMC– MCMC for Bayesian model averaging– MCMC over the values of the unobserved

nodes.


Conclusion

• Has learning of structure important meaning?– In paper, Yes.– In engineering, No.

• What can AI do for human?• What can human do for Machine

learning algorithm?

Copyright (c) 2002 by SNU CSE Biointelligence Lab. 1 SURVEY: Foundations of Bayesian Networks O, Jangmin 2002/10/29 Last modified 2002/10/29.

Documents