Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

Learning Bayesian Network Structure from Massive Datasets:

The ``Sparse Candidate'' Algorithm

Nir Friedman Dana Pe'er Iftach NachmanInstitute of Computer Science

Hebrew University

Jerusalem

.

Learning Bayesian Network Structure(Complete Data)

Set a scoring function that evaluates networks Find the highest scoring network

This optimization problem is NP-hard [Chickering]

use heuristic search

InducInducerer

InducInducerer

E

R

B

A

C

Data

.

Our Contribution

We suggest a new heuristic Builds on simple ideas Easy to implement Can be combined with existing heuristic search

procedures Reduces learning time significantly

Also gain some insight on the complexity of learning problem

.

Learning Bayesian Network Structure: Score

Various variants of scores We focus here on Bayesian score

[Cooper & Hershkovitz; Heckerman, Geiger & Chickering]Key property for search:

The score decomposes:

where N (Xi,PaiG) is a vector of counts of joint

values of Xi and its parents in G the data

))(:|():( Gii

i

Gi PaXNPaXScoreDGScore

i

.

Search over network structuresStandard operations: add, delete, reverseNeed to check acyclicty

A B

C

A B

C

A B

C

Add A B

Reverse B CRemove B C

A B

C

Heuristic Search in Learning Networks

Use standard search method in this space:greedy hill climbing, simulated annealing, etc.

.

Computational Problem

Cost of evaluating a single move Collecting counts N (xi,pai) is O(M) (M = # of examples) Using caching we can save some of these computations

Number of possible moves Number of possible moves is O(N2) (N = number of vars.) After performing a move, O(N) new moves to be evaluated

TotalEach iteration of greedy HC costs O(MN)

Most of the time spent on evaluating irrelevant moves

.

Idea #1: Restrict to Few Candidates

For each X, select a small set of candidates C(X) Consider arcs Y X only if Y is in C(X)

A B

C

C(A) = { B }

C(B) = {A}

C(C) = {A, B}

If we restrict to k candidate for each variable, then only O(kN) possible moves for each network in greedy HC, only O(k) new moves to evaluate in each

iteration Cost of each iteration is O(M k)

BA CA XA->C C->B X

.

How to Select Candidates?

Simple proposal: Rank candidates by mutual information to X

This measures how many bits, we can save in encoding of X if we take Y into account

Select top k ranking variables for C(X)

)|()()()(

),(log);( YXHXH

YPXPYXP

EYXI

.

Effect of Candidate Number on Search

-56.5

-56

-55.5

-55

-54.5

-54

-53.5

-53

-52.5

-52

Time (sec)0 200 400 600 800 1000 1200

Sco

re (

BD

e/M

)

HCHC k=5HC k=10HC k=15C+L

HC

k=5k=10k=15

C+L

Empty

Text domain with 100 vars, 10,000 instances

Computation of all pairwise statistics

.

HRBP HREKG HRSAT

ERRCAUTERHR

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG

ANAPHYLAXIS

MINVOL

PVSAT

FIO2

INSUFFANESTH

TPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUME

HYPOVOLEMIA

BP

Problems with Candidate Selection

Fragment of “alarm” network

CO

INTUBATION

PULMEMBOLUS

PAP SHUNT

.

Idea #2: Iteratively Improve Candidates

Once we have partial understanding of the domain, we might use it select new candidates:

“current” parents + most promising candidates given the current

structure

If INTUBATION isparent of SHUNT, thenMINVOL is less informativeabout SHUNT

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG

MINVOL

PVSAT

FIO2

INTUBATION

PULMEMBOLUS

PAP SHUNT

.

Comparing Potential Candidates

Intuition: X should be Markov shielded by its parents PaX

Shielding: use conditional information Does adding Y to X’s parents improves prediction? I(X;Y|PaX) = 0 iff X is independent from Y given PaX

Score: use difference in score

Use Score(X|Y) as an estimate of -H(X|Y) in generating distribution

),|()|()|;( XXX PaYXHPaXHPaYXI

)|(),|()|;( XXX PaXScorePaYXScorePaYXS

.

“Alarm” example revisited

SAO2

HRBP HREKG HRSAT

ERRCAUTERHR

CATECHOL

EXPCO2

ARTCO2

VENTALV

VENTLUNG

ANAPHYLAXIS

MINVOL

PVSAT

FIO2

INSUFFANESTH

TPR

LVFAILURE


HYPOVOLEMIA

BP CO

INTUBATION

PULMEMBOLUS

PAP SHUNT

.

“Alarm” example revisited

SAO2

HRBP HREKG HRSAT

ERRCAUTERHR

CATECHOL

EXPCO2

ARTCO2

VENTALV

VENTLUNG

ANAPHYLAXIS

MINVOL

PVSAT

FIO2

INSUFFANESTH

TPR

LVFAILURE


HYPOVOLEMIA

BP CO

INTUBATION

PULMEMBOLUS

PAP SHUNT

.

Alternative Criterion: Discrepancy

Idea: Measure how well the network models the joint P(X,Y) We can improve this prediction by making X a candidate

parent of YNatural definition:

Note, if PB(X,Y) = P(X)P(Y), then d(X,Y|B) = I(X;Y)

),();,(),(),(

log)|;( YXPYXPKLYXPYXP

EBYXd BB

.

Text with 100 words

-54

-53.5

-53

-52.5

-52

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Greedy HCDisc k=15

Score k=15Shld k=15

Time (sec)

Sco

re (

BD

e/M

)

.

-83.4

-83.2

-83

-82.8

-82.6

-82.4

-82.2

-82

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Greedy HCDisc k=15

Score k=15Shld k=15

Text with 200 words

Time (sec)

Sco

re (

BD

e/L

)

.

Cell Cycle (800 vars)

Time (sec)

Sco

re (

BD

e/L

)

Greedy HCDisc k=20

Score k=20Shld k=20

-500

-490

-480

-470

-460

-450

-440

-430

-420

-410

0 5,000 10,000 15,000 20,000

-418

-417

-416

-415

-414

4000 6000 8000

.

Complexity of Structure Learning

Without restriction of the candidate sets:Restricting |Pai| 1

Problem is easy [Chow+Liu; Heckerman+al]No restriction

Problem is NP-Hard [Chickering]

Even when restricting |Pai| 2 We do not know of interesting intermediate

problemsSuch behavior is often called the “exponential cliff”

.

Complexity with Small Candidate Sets

In each iteration, we solve an optimization problem:Given candidate sets C(X1),…, C(XN), find best scoring network that respects these candidates

Is this problem easier than unconstrained structure learning?

.

Complexity with Small Candidate Sets

Theorem: If |C(Xi) | > 1 finding best-scoring structure is NP-Hard

But… The complexity function is gradually growingThere is a parameter c, s.t. time complexity is

Exponential in c Linear in N

Fix d. There is polynomial procedure that can solve all instances with c < d Similar situation in inference: exponential in the size of largest clique in triangulated graph, linear in N

.

Complexity Proof Outline

In fact, the algorithm is motivated by inference

Define the “candidate graph” where Y X if Y C(X)

Then, create a clique tree (moralize & triangulate)

We then define a dynamic programming algorithm for constructing the best scoring structure

Messages assign values to different ordering of variables in a separator

Ordering ensures acyclicity of the network

Order Score

AB -18.5

BA -13.2

Order Score

EB -4.7

BE -12.1

A,B,E,F

A,B,C,D

A,B

B,E,G

B,E

.

Future Work

Quadratic cost of candidate selection Initial step requires O(N2) pairwise statistics Can we select candidates by looking at smaller number,

e.g., O(N log N), of pairwise statistics

Choice of number of candidates We used a fixed number of candidates Can we decide on candidate number more intelligently? Deal with variables that have large in+out degree

Combine candidates with PDAG search

.

SummaryHeuristic for structure search

Incorporates understanding of BNs into blind search Drastically reduces the size of the search space

faster search that requires fewer statisticsEmpirical Evaluation

We present evaluation on several datasets Variants of the algorithm used in

[Boyen,Friedman&Koller] for temporal models with SEM[Friedman,Getoor,Koller&Pfeffer] for relational models

Complexity Analysis Computational subproblem where structure search might

be tractable even beyond trees

Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

Documents