Top Banner
Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm Nir Friedman Dana Pe'er Iftach Nachman Institute of Computer Science Hebrew University Jerusalem
24

Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

Jan 11, 2016

Download

Documents

Yanka

Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm. Nir Friedman Dana Pe'er Iftach Nachman Institute of Computer Science Hebrew University Jerusalem. Data. Inducer. B. E. R. A. C. Learning Bayesian Network Structure (Complete Data). - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

Learning Bayesian Network Structure from Massive Datasets:

The ``Sparse Candidate'' Algorithm

Nir Friedman Dana Pe'er Iftach NachmanInstitute of Computer Science

Hebrew University

Jerusalem

Page 2: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Learning Bayesian Network Structure(Complete Data)

Set a scoring function that evaluates networks Find the highest scoring network

This optimization problem is NP-hard [Chickering]

use heuristic search

InducInducerer

InducInducerer

E

R

B

A

C

Data

Page 3: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Our Contribution

We suggest a new heuristic Builds on simple ideas Easy to implement Can be combined with existing heuristic search

procedures Reduces learning time significantly

Also gain some insight on the complexity of learning problem

Page 4: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Learning Bayesian Network Structure: Score

Various variants of scores We focus here on Bayesian score

[Cooper & Hershkovitz; Heckerman, Geiger & Chickering]Key property for search:

The score decomposes:

where N (Xi,PaiG) is a vector of counts of joint

values of Xi and its parents in G the data

))(:|():( Gii

i

Gi PaXNPaXScoreDGScore

i

Page 5: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Search over network structuresStandard operations: add, delete, reverseNeed to check acyclicty

A B

C

A B

C

A B

C

Add A B

Reverse B CRemove B C

A B

C

Heuristic Search in Learning Networks

Use standard search method in this space:greedy hill climbing, simulated annealing, etc.

Page 6: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Computational Problem

Cost of evaluating a single move Collecting counts N (xi,pai) is O(M) (M = # of examples) Using caching we can save some of these computations

Number of possible moves Number of possible moves is O(N2) (N = number of vars.) After performing a move, O(N) new moves to be evaluated

TotalEach iteration of greedy HC costs O(MN)

Most of the time spent on evaluating irrelevant moves

Page 7: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Idea #1: Restrict to Few Candidates

For each X, select a small set of candidates C(X) Consider arcs Y X only if Y is in C(X)

A B

C

C(A) = { B }

C(B) = {A}

C(C) = {A, B}

If we restrict to k candidate for each variable, then only O(kN) possible moves for each network in greedy HC, only O(k) new moves to evaluate in each

iteration Cost of each iteration is O(M k)

BA CA XA->C C->B X

Page 8: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

How to Select Candidates?

Simple proposal: Rank candidates by mutual information to X

This measures how many bits, we can save in encoding of X if we take Y into account

Select top k ranking variables for C(X)

)|()()()(

),(log);( YXHXH

YPXPYXP

EYXI

Page 9: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Effect of Candidate Number on Search

-56.5

-56

-55.5

-55

-54.5

-54

-53.5

-53

-52.5

-52

Time (sec)0 200 400 600 800 1000 1200

Sco

re (

BD

e/M

)

HCHC k=5HC k=10HC k=15C+L

HC

k=5k=10k=15

C+L

Empty

Text domain with 100 vars, 10,000 instances

Computation of all pairwise statistics

Page 10: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

HRBP HREKG HRSAT

ERRCAUTERHR

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG

ANAPHYLAXIS

MINVOL

PVSAT

FIO2

INSUFFANESTH

TPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUME

HYPOVOLEMIA

BP

Problems with Candidate Selection

Fragment of “alarm” network

CO

INTUBATION

PULMEMBOLUS

PAP SHUNT

Page 11: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Idea #2: Iteratively Improve Candidates

Once we have partial understanding of the domain, we might use it select new candidates:

“current” parents + most promising candidates given the current

structure

If INTUBATION isparent of SHUNT, thenMINVOL is less informativeabout SHUNT

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG

MINVOL

PVSAT

FIO2

INTUBATION

PULMEMBOLUS

PAP SHUNT

Page 12: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Comparing Potential Candidates

Intuition: X should be Markov shielded by its parents PaX

Shielding: use conditional information Does adding Y to X’s parents improves prediction? I(X;Y|PaX) = 0 iff X is independent from Y given PaX

Score: use difference in score

Use Score(X|Y) as an estimate of -H(X|Y) in generating distribution

),|()|()|;( XXX PaYXHPaXHPaYXI

)|(),|()|;( XXX PaXScorePaYXScorePaYXS

Page 13: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

“Alarm” example revisited

SAO2

HRBP HREKG HRSAT

ERRCAUTERHR

CATECHOL

EXPCO2

ARTCO2

VENTALV

VENTLUNG

ANAPHYLAXIS

MINVOL

PVSAT

FIO2

INSUFFANESTH

TPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUME

HYPOVOLEMIA

BP CO

INTUBATION

PULMEMBOLUS

PAP SHUNT

Page 14: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

“Alarm” example revisited

SAO2

HRBP HREKG HRSAT

ERRCAUTERHR

CATECHOL

EXPCO2

ARTCO2

VENTALV

VENTLUNG

ANAPHYLAXIS

MINVOL

PVSAT

FIO2

INSUFFANESTH

TPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUME

HYPOVOLEMIA

BP CO

INTUBATION

PULMEMBOLUS

PAP SHUNT

Page 15: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Alternative Criterion: Discrepancy

Idea: Measure how well the network models the joint P(X,Y) We can improve this prediction by making X a candidate

parent of YNatural definition:

Note, if PB(X,Y) = P(X)P(Y), then d(X,Y|B) = I(X;Y)

),();,(),(),(

log)|;( YXPYXPKLYXPYXP

EBYXd BB

Page 16: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Text with 100 words

-54

-53.5

-53

-52.5

-52

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Greedy HCDisc k=15

Score k=15Shld k=15

Time (sec)

Sco

re (

BD

e/M

)

Page 17: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

-83.4

-83.2

-83

-82.8

-82.6

-82.4

-82.2

-82

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Greedy HCDisc k=15

Score k=15Shld k=15

Text with 200 words

Time (sec)

Sco

re (

BD

e/L

)

Page 18: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Cell Cycle (800 vars)

Time (sec)

Sco

re (

BD

e/L

)

Greedy HCDisc k=20

Score k=20Shld k=20

-500

-490

-480

-470

-460

-450

-440

-430

-420

-410

0 5,000 10,000 15,000 20,000

-418

-417

-416

-415

-414

4000 6000 8000

Page 19: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Complexity of Structure Learning

Without restriction of the candidate sets:Restricting |Pai| 1

Problem is easy [Chow+Liu; Heckerman+al]No restriction

Problem is NP-Hard [Chickering]

Even when restricting |Pai| 2 We do not know of interesting intermediate

problemsSuch behavior is often called the “exponential cliff”

Page 20: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Complexity with Small Candidate Sets

In each iteration, we solve an optimization problem:Given candidate sets C(X1),…, C(XN), find best scoring network that respects these candidates

Is this problem easier than unconstrained structure learning?

Page 21: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Complexity with Small Candidate Sets

Theorem: If |C(Xi) | > 1 finding best-scoring structure is NP-Hard

But… The complexity function is gradually growingThere is a parameter c, s.t. time complexity is

Exponential in c Linear in N

Fix d. There is polynomial procedure that can solve all instances with c < d Similar situation in inference: exponential in the size of largest clique in triangulated graph, linear in N

Page 22: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Complexity Proof Outline

In fact, the algorithm is motivated by inference

Define the “candidate graph” where Y X if Y C(X)

Then, create a clique tree (moralize & triangulate)

We then define a dynamic programming algorithm for constructing the best scoring structure

Messages assign values to different ordering of variables in a separator

Ordering ensures acyclicity of the network

Order Score

AB -18.5

BA -13.2

Order Score

EB -4.7

BE -12.1

A,B,E,F

A,B,C,D

A,B

B,E,G

B,E

Page 23: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

Future Work

Quadratic cost of candidate selection Initial step requires O(N2) pairwise statistics Can we select candidates by looking at smaller number,

e.g., O(N log N), of pairwise statistics

Choice of number of candidates We used a fixed number of candidates Can we decide on candidate number more intelligently? Deal with variables that have large in+out degree

Combine candidates with PDAG search

Page 24: Learning Bayesian Network Structure from Massive Datasets: The ``Sparse Candidate'' Algorithm

.

SummaryHeuristic for structure search

Incorporates understanding of BNs into blind search Drastically reduces the size of the search space

faster search that requires fewer statisticsEmpirical Evaluation

We present evaluation on several datasets Variants of the algorithm used in

[Boyen,Friedman&Koller] for temporal models with SEM[Friedman,Getoor,Koller&Pfeffer] for relational models

Complexity Analysis Computational subproblem where structure search might

be tractable even beyond trees