A Tutorial on Inference and Learning

A Tutorial on Inference and Learning in Bayesian Networks

Irina Rish Moninder Singh

IBM T.J.Watson Research Centerrish,[email protected]

“Road map” Introduction: Bayesian networks

What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc

Probabilistic inference Exact inference Approximate inference

Learning Bayesian Networks Learning parameters Learning graph structure

Summary

Bayesian Networks

= P(A) P(S) P(T|A) P(L|S) P(B|S) P(C|T,L) P(D|T,L,B)

P(A, S, T, L, B, C, D)

Conditional Independencies Efficient Representation

Θ) (G,BN

CPD: T L B D=0 D=10 0 0 0.1 0.90 0 1 0.7 0.30 1 0 0.8 0.20 1 1 0.9 0.1 ...

Lung Cancer

Smoking

Chest X-ray

Bronchitis

Dyspnoea

Tuberculosis

Visit to Asia

P(D|T,L,B)

P(B|S)

P(S)

P(C|T,L)

P(L|S)

P(A)

P(T|A)

[Lauritzen & Spiegelhalter, 95]

Bayesian Networks Structured, graphical representation of

probabilistic relationships between several random variables

Explicit representation of conditional independencies

Missing arcs encode conditional independence

Efficient representation of joint pdf Allows arbitrary queries to be answeredP (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

Example: Printer Troubleshooting (Microsoft Windows 95)

Print OutputOK

Correct Driver

UncorruptedDriver

CorrectPrinter Path

Net CableConnected

Net/LocalPrinting

Printer On and Online

CorrectLocal Port

Correct Printer

Selected

Local CableConnected

ApplicationOutput OK

PrintSpooling On

Correct Driver

Settings

Printer MemoryAdequate

NetworkUp

SpooledData OK

GDI DataInput OK

GDI Data Output OK

PrintData OK

PC to PrinterTransport OK

PrinterData OK

SpoolProcess OK

NetPath OK

LocalPath OK

PaperLoaded

Local DiskSpace Adequate

[Heckerman, 95]

[Heckerman, 95]

Example: Microsoft Pregnancy and Child Care)

[Heckerman, 95]

Example: Microsoft Pregnancy and Child Care)

Independence Assumptions

Tuberculosis

Visit to Asia

Chest X-ray

Head-to-tail

Lung Cancer

Smoking

Bronchitis

tail-to-tail

Dyspnoea

Lung Cancer Bronchitis

Head-to-head

Independence Assumptions Nodes X and Y are d-connected by nodes

in Z along a trail from X to Y if every head-to-head node along the trail is in Z

or has a descendant in Z every other node along the trail is not in Z

Nodes X and Y are d-separated by nodes in Z if they are not d-connected by Z along any trail from X to Y

Nodes X and Y are d-separated by Z implies X and Y are conditionally independent given Z

Independence AssumptionsA variable (node) is conditionally independent of its

non-descendants given its parents

Lung Cancer

Smoking

Bronchitis

Dyspnoea

Tuberculosis

Visit to Asia

Chest X-ray


Cancer

Smoking

Lung Tumor

Diet

Serum Calcium

Age Gender

Exposure to Toxins

Cancer is independentof Diet given Exposure to Toxinsand Smoking

[Breese & Koller, 97]

Independence AssumptionsWhat this means is that joint pdf can be represented as product

of local distributions

P(A,S,T,L,B,C,D) = P(A) . P(S|A) . P(T|A,S) . P(L|A,S,T) . P(B|A,S,T,L) . P(C|A,S,T,L,B) . P(D|A,S,T,L,B,C)

= P(A) . P(S) . P(T|A) . P(L|S) .P(B|S) . P(C|T,L) . P(D|T,L,B)

Lung Cancer

Smoking

Bronchitis

Dyspnoea

Tuberculosis

Visit to Asia

Chest X-ray

Thus, the General Product rule for Bayesian Networks is

P(X1,X2,…,Xn) = P(Xi | Pa(Xi))

where Pa(Xi) is the set of parents of Xi


i=1

n

The Knowledge Acquisition Task Variables:

collectively exhaustive, mutually exclusive values clarity test: value should be knowable in principle

Structure if data available, can be learned constructed by hand (using “expert” knowledge) variable ordering matters: causal knowledge usually

simplifies

Probabilities can be learned from data second decimal usually does not matter; relative probs sensitivity analysis

The Knowledge Acquisition Task

Fuel Gauge StartBattery TurnOver

Variable Order is Important

BatteryTurnOverStart FuelGauge

Fuel

Gauge

Start

Battery

TurnOver

Causal Knowledge Simplifies Construction

Naive Baysian Classifiers [Duda&Hart; Langley 92]Naive Baysian Classifiers [Duda&Hart; Langley 92]

Selective Naive Bayesian Classifiers [Langley & Sage 94]Selective Naive Bayesian Classifiers [Langley & Sage 94]

Conditional Trees [Geiger 92; Friedman et al 97]Conditional Trees [Geiger 92; Friedman et al 97]



Selective Bayesian Networks [Singh & Provan, 95;96]

Diagnosis: P(cause|symptom)=?

Medicine Bio-informatics

Computer troubleshooting

Stock market

Text Classification

Speechrecognition

Prediction: P(symptom|cause)=?

classmax Classification: P(class|

data) Decision-making (given a cost function) Data mining: induce best model from data

What are BNs useful for?


Cause

Effect

Predictive Inference

Cause

Effect

Diagnostic Reasoning

Unknown butimportant

ImperfectObservations

Value

Decision

Known Predisposing

Factors

Decision Making - Max. Expected Utility


SalientObservations

Fault 1Fault 2Fault 3

.

.

.

Assignmentof Belief

Act Now! Halt?Yes

No

Next BestObservation

(Value of Information)

New Obs.

Probability of fault “i”

Exp

ecte

d U

tili

ty

Do nothing

Action 2

Action 1

Value of Information

Why use BNs?

Explicit management of uncertainty Modularity implies maintainability Better, flexible and robust decision

making - MEU, VOI Can be used to answer arbitrary queries -

multiple fault problems Easy to incorporate prior knowledge Easy to understand

Application Examples Intellipath

commercial version of Pathfinder lymph-node diseases (60), 100 findings

APRI system developed at AT&T Bell Labs learns & uses Bayesian networks from data to identify

customers liable to default on bill payments

NASA Vista system predict failures in propulsion systems considers time criticality & suggests highest utility

action dynamically decide what information to show

Application Examples Answer Wizard in MS Office 95/ MS Project

Bayesian network based free-text help facility uses naive Bayesian classifiers

Office Assistant in MS Office 97 Extension of Answer wizard uses naïve Bayesian networks help based on past experience (keyboard/mouse use)

and task user is doing currently This is the “smiley face” you get in your MS Office

applications

Application Examples Microsoft Pregnancy and Child-Care

Available on MSN in Health section Frequently occuring children’s symptoms are linked

to expert modules that repeatedly ask parents relevant questions

Asks next best question based on provided information

Presents articles that are deemed relevant based on information provided

Application Examples Printer troubleshooting

HP bought 40% stake in HUGIN. Developing printer troubleshooters for HP printers

Microsoft has 70+ online troubleshooters on their web site use Bayesian networks - multiple faults models, incorporate

utilities

Fax machine troubleshooting Ricoh uses Bayesian network based troubleshooters at

call centers Enabled Ricoh to answer twice the number of calls in half

the time

Application Examples



Online/print resources on BNs

Conferences & Journals UAI, ICML, AAAI, AISTAT, KDD MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI

Books and Papers Bayesian Networks without Tears by Eugene

Charniak. AI Magazine: Winter 1991. Probabilistic Reasoning in Intelligent Systems

by Judea Pearl. Morgan Kaufmann: 1998. Probabilistic Reasoning in Expert Systems by

Richard Neapolitan. Wiley: 1990. CACM special issue on Real-world applications

of BNs, March 1995

Online/Print Resources on BNs

Wealth of online information at www.auai.org Links to

Electronic proceedings for UAI conferences Other sites with information on BNs and

reasoning under uncertainty Several tutorials and important articles Research groups & companies working in this

area Other societies, mailing lists and conferences

Publicly available s/w for BNs

List of BN software maintained by Russell Almond at bayes.stat.washington.edu/almond/belief.html

several free packages: generally research only commercial packages: most powerful (&

expensive) is HUGIN; others include Netica and Dxpress

we are working on developing a Java based BN toolkit here at Watson - will also work within ABLE

“Road map” Introduction: Bayesian networks

What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc

Probabilistic inference Exact inference Approximate inference


Summary

Probabilistic Inference Tasks

X/A

a

*k

*1 e),xP(maxarg)a,...,(a

evidence)|xP(X)BEL(X iii

Belief updating:

Finding most probable explanation (MPE)

Finding maximum a-posteriory hypothesis

Finding maximum-expected-utility (MEU) decision

e),xP(maxarg*xx

)xU(e),xP(maxarg)d,...,(d X/D

d

*k

*1

variableshypothesis: XA

function utilityx variablesdecision

: )( :

UXD

Belief Updating

lung Cancer

Smoking

X-ray

Bronchitis

Dyspnoea

P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

Belief updating: P(X|evidence)=?

“Moral” graph

A

D E

CB

P(a|e=0) P(a,e=0)=

bcde ,,,0

P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)=

0e

P(a) d

),,,( ecdahB

b

P(b|a)P(d|b,a)P(e|b,c)

B C

ED

Variable Elimination

P(c|a)c

Bucket elimination Algorithm elim-bel (Dechter 1996)

b

Elimination operator

P(a|e=0)

W*=4”induced width” (max clique size)

bucket B:

P(a)

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e=0

B

C

D

E

A

e)(a,hD

(a)hE

e)c,d,(a,hB

e)d,(a,hC

b

maxElimination operator

MPE

W*=4”induced width” (max clique size)

bucket B:

P(a)

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e=0

B

C

D

E

A

e)(a,hD

(a)hE

e)c,d,(a,hB

e)d,(a,hC

Finding Algorithm elim-mpe (Dechter 1996)

)xP(maxMPEx

),|(),|()|()|()(maxby replaced is

,,,,cbePbadPabPacPaPMPE

:

bcdea max

Generating the MPE-tuple

C:

E:

P(b|a) P(d|b,a) P(e|b,c)B:

D:

A: P(a)

P(c|a)

e=0 e)(a,hD

(a)hE

e)c,d,(a,hB

e)d,(a,hC

(a)hP(a)max arga' 1. E

a

0e' 2.

)e'd,,(a'hmax argd' 3. C

d

)e'c,,d',(a'h

)a'|P(cmax argc' 4.B

c

)c'b,|P(e')a'b,|P(d')a'|P(bmax argb' 5.

b

)e',d',c',b',(a' Return

Complexity of inference

))((exp ( * dwnOddw ordering along graph moral of widthinduced the)(*

The effect of the ordering:

4)( 1* dw 2)( 2

* dw“Moral” graph

A

D E

CB

B

C

D

E

A

E

D

C

B

A

Other tasks and algorithms

MAP and MEU tasks: Similar bucket-elimination algorithms - elim-map, elim-

meu (Dechter 1996) Elimination operation: either summation or maximization Restriction on variable ordering: summation must precede

maximization (i.e. hypothesis or decision variables are eliminated last)

Other inference algorithms: Join-tree clustering Pearl’s poly-tree propagation Conditioning, etc.

Relationship with join-tree clustering

))()())

(a || haPAbucket(a,b hP(b|a) ||bucket(B)(a,b hP(c|a) ||bucket(C)

P(d|a,b)bucket(D) P(e|b,c)bucket(E)

B

C

D

ED,C,B,A, :Ordering

ABC

BCE

ADBA cluster is a set of buckets (a “super-bucket”)

Relationship with Pearl’s belief propagation in poly-trees

Pearl’s belief propagation for single-root query

1X

2Z

1Z

3U

1Y

1U

2U

3Z

1Z 2Z 3Z

1U 2U 3U

1X

1Y

)|(

)(

11

11

uzP

uZ

elim-bel using topological ordering and super-buckets for

families

Elim-bel, elim-mpe, and elim-map are linear for poly-trees.

)( 22uZ )( 33

uZ

)( 11xY

“Diagnostic support”

“Causal support”

)( 1x

“Road map”

Introduction: Bayesian networks Probabilistic inference

Exact inference Approximate inference


Summary

Inference is NP-hard => approximations

exp(w*)) O(n

Approximations:

Local inference Stochastic simulations Variational approximations etc.

S

X D

BCC B

DX

Local Inference Idea

Bucket-elimination approximation: “mini-buckets”

Local inference idea: bound the size of recorded dependencies

Computation in a bucket is time and space exponential in the number of variables involved

Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables

Mini-bucket approximation: MPE task

Split a bucket into mini-buckets =>bound complexity

XX gh )()()O(e :decrease complexity lExponentia n rnr eOeO

Approx-mpe(i)

Input: i – max number of variables allowed in a mini-bucket Output: [lower bound (P of a sub-optimal solution), upper bound]

Example: approx-mpe(3) versus elim-mpe

2* w 4* w

Properties of approx-mpe(i)

Complexity: O(exp(2i)) time and O(exp(i)) time.

Accuracy: determined by upper/lower (U/L) bound.

As i increases, both accuracy and complexity increase.

Possible use of mini-bucket approximations: As anytime algorithms (Dechter and Rish, 1997) As heuristics in best-first search (Kask and Dechter,

1999)

Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997)

Anytime Approximation

UL

L

U

mpe(i)-approxL mpe(i)-approxU

iii

ii

step

smallest theand largest the

solution return ,11

far so found solutionbest thekeepby computed boundlower by computed boundupper

available are resources space and time

0

returnend

if

While :Initialize

)mpe(-anytime

Empirical Evaluation(Dechter and Rish, 1997; Rish, 1999)

Randomly generated networks Uniform random probabilities Random noisy-OR

CPCS networks Probabilistic decoding

Comparing approx-mpe and anytime-mpe

versus elim-mpe

Random networks Uniform random: 60 nodes, 90 edges (200 instances)

In 80% of cases, 10-100 times speed-up while U/L<2 Noisy-OR – even better results

Exact elim-mpe was infeasible; appprox-mpe took 0.1 to 80 sec.

q

iy

in qqyyxPi

parameter noise random),...,|0(1

1

Anytime-mpe(0.0001) U/L error vs time

Time and parameter i

1 10 100 1000

Up

pe

r/L

ow

er

0.6

1.0

1.4

1.8

2.2

2.6

3.0

3.4

3.8 cpcs422b cpcs360b

i=1 i=21

CPCS networks – medical diagnosis(noisy-OR model)

Test case: no evidence

505.2 70.3anytime-mpe( ),

110.5 70.3anytime-mpe( ),

1697.6 115.8elim-mpe

cpcs422 cpcs360 AlgorithmTime (sec)

410 110

log(U/L)

0 2 4 6 8 10 12 0

100

200

300

400

500

600

700

800

900

1000

Freq

uenc

y

log(U/L) histogram for i=10 on 1000 instances of random evidence

log(U/L) histogram for i=10 on 1000 instances of likely evidence

log(U/L)

0 1 2 3 4 5 6 7 8 9 10 11 12 0

100

200

300

400

500

600

700

800

900

1000

Freq

uenc

y

Effect of evidence

More likely evidence=>higher MPE => higher accuracy (why?)

Likely evidence versus random (unlikely) evidence

Probabilistic decoding

Error-correcting linear block code

State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks)

approx-mpe vs. IBP

codes *w-low onbetter is mpe-approxcodes w*)-(high generatedrandomly onbetter is IBP

Bit error rate (BER) as a function of noise (sigma):

Mini-buckets: summary

Mini-buckets – local inference approximation

Idea: bound size of recorded functions

Approx-mpe(i) - mini-bucket algorithm for MPE Better results for noisy-OR than for random

problems Accuracy increases with decreasing noise in Accuracy increases for likely evidence Sparser graphs -> higher accuracy Coding networks: approx-mpe outperfroms IBP on

low-induced width codes

“Road map”



Local inference Stochastic simulations Variational approximations

Learning Bayesian Networks Summary

Approximation via Sampling

(MCMC) sampling Gibbs * weighinglikelihood *

:ues their val tonodes evidence clamping"" - sampling) forward (e.g., rejection-acceptance -

? Eevidence handle How to3.

, #

)(

:sfrequencieby iesprobabilit Estimate2. )x,...,x,(xs where),s,...,s(

: ( from samples Generate 1.

in

i2

i1

iN1

N

yYwithsamplesyYP

PN

SX)

Forward Sampling(logic sampling (Henrion, 1988))

2 step and 1 5.: , and .4

)|( from sample 3. to .2

to# 1.

withconsistent samples :),...,( ordering an

samples, of # - evidence, - :1

goto

ixXEX

paxPxXn1i

N1sample

EN XXoancestral

NE

iii

iiii

n

sample rejectif

forFor

Output

Input

Forward sampling (example)

1X

2X 3X

4X

)( 1xP

)|( 12 xxP

),|( 324 xxxP

)|( 13 xxP

)|( from sample 5.otherwise 1, fromstart and

samplereject 0, If .4)|( from Sample .3)|( from Sample .2

)( from Sample .1 sample generate//

0 :Evidence

3,244

3

133

122

11

3

xxxPx

xxxPxxxPx

xPxk

X

Drawback: high rejection rate!

Likelihood Weighing(Fung and Chang, 1990; Shachter and Peot, 1990)

y Y wheres

EXi

1

)lescore(sampE)|y P(YThenscores normalize .7

)|P(ele)score(samp .6)|( from sample 5.

.4 to# 3.

.),...,( :nodes theof an Find2.

. assign , 1.

i

amples

i

iiii

i

n

iii

papaxPxX

EXN1sample

XXoorderingancestral

exEX

forFor

each For

Works well for likely evidence!

“Clamping” evidence+forward sampling+ weighing samples by evidence likelihood

Gibbs Sampling(Geman and Geman, 1984)

Markov Chain Monte Carlo (MCMC):create a Markov chain of samples

}){\|( from sample 5. .4

to# 3. , 2.

. , 1.

iiii

i

ii

iii

XXxPxXEX

N1samplevaluerandomxEX

exEX

forFor

each For each For

Advantage: guaranteed to converge to P(X)Disadvantage: convergence may be slow

Gibbs Sampling (cont’d)(Pearl, 1988)

ij chX

jjiiii paxPpaxPXXxP )|()|(}){\|(

:locally computed is }){\|( :Important ii XXxP

iX )()( jj chX

jiii pachpaXM

Markov blanket:

nodesother all oft independen is parents), their andchildren, (parents,

Given

iX

blanketMarkov

“Road map”



Local inference Stochastic simulations Variational approximations

Learning Bayesian Networks Summary

Variational Approximations

Idea: variational transformation of CPDs simplifies

inferenceAdvantages: Compute upper and lower bounds on P(Y) Usually faster than sampling techniquesDisadvantages: More complex and less general: must be

derived for each particular form of CPD functions

Variational bounds: example

log(x) 1log

}1log{min

)log(

x

x

x

parameter lvariationa -

This approach can be generalized for any concave (convex) function in order to compute its upper (lower) bounds: convex duality (Jaakkola and Jordan, 1997)

Convex duality (Jaakkola and Jordan, 1997)

bounds. lowerconvex

bounds upper

function dualconcave

get we,)( For .2

)()( )()(

get weand

)}({min)(

)}({min)(:s.t. )( a hasit is )( If 1.

*

*

*

*

*

xf

xfxffxxf

xfxf

fxxf f ,xf

T

T

T

x

T

Example: QMR-DT network(Quick Medical Reference – Decision-Theoretic (Shwe et al., 1991))

Noisy-OR model:

ij

j

pad

dijii qqdfP )1()1()|0( 0

1d 2d kd

1f 2f 3f nf

600 diseases

4000 findings

1log- where

)|0(

,0

)-q(

edfP

ijij

jdii

ipajd ij

Inference in QMR-DT

Inference complexity: O(exp(min{p,k})) p = # of positive findings, k = max family size(Heckerman, 1989 (“Quickscore”), Rish and Dechter, 1998)

jii dj

fi

fi dPdfPdfP

dPdfPfdP)( )|( )|(

)()|(),(

01

j

ij

ifij

i

i

ipajd ij

d

padf

i

f

jdi

ee

e

][

0

0

0

0

0

1

0 )1(i

ipajd ij

f

jdie

Positive evidence “couples” the disease nodes

k,...,dd

fdPfdP2

),( )|( 1 :Inference

factorized

factorized

Variational approach to QMR-DT(Jaakkola and Jordan, 1997)

ipajd

jijiii

ipajd iji

ipajd ij

dfifjdi

i

jdii

x

eeedfP

edfP

fdualconcaveexf

][)|1(

:by bounded be can 1)|1( Then

)1ln()1(ln)( a has and is )1ln()(

)(0

)()0(

0

*

**

The effect of positive evidence is now factorized (diseases are “decoupled”)

Variational approximations

Bounds on local CPDs yield a bound on posterior

Two approaches: sequential and block Sequential: applies variational

transformation to (a subset of) nodes sequentially during inference using a heuristic node ordering; then optimizes across variational parameters

Block: selects in advance nodes to be transformed, then selects variational parameters minimizing the KL-distance between true and approximate posteriors

Block approach

distance (KL)Leibler - Kullback theis )||( where

)||(minarg Find

bounds iational their var withCPDs some replacingafter ionapproximat),|(

evidence given ofposterior exact )|(

*

PQDPQD

EYQEYEYP

)(

)(log)()||(

S SP

SQSQPQD

Inference in BN: summary

Exact inference is often intractable => need approximations Approximation principles:

Approximating elimination – local inference, bounding size of dependencies among variables (cliques in a problem’s graph).

Mini-buckets, IBP Other approximations: stochastic simulations, variational

techniques, etc. Further research:

Combining “orthogonal” approximation approaches Better understanding of “what works well where”: which

approximation suits which problem structure Other approximation paradigms (e.g., other ways of

approximating probabilities, constraints, cost functions)

“Road map”




Summary

Why learn Bayesian networks?

Incremental learning: P(H) or

S C Learning causal relationships:

Efficient representation and inference

Handling missing data: <1.3 2.8 ?? 0 1 >

<9.7 0.6 8 14

18> <0.2 1.3 5 ?? ??

> <1.3 2.8 ?? 0 1

> <?? 5.6 0 10 ??

> ……………….

Combining domain expert knowledge with data

Learning Bayesian Networks

Known graph

C

S

B

DX

Complete data: parameter estimation (ML, MAP)Incomplete data: non-linear parametric optimization (gradient descent, EM)

P(S)

P(B|S)

P(X|C,S)

P(C|S)

P(D|C,B)

– learn parameters

CS

B

DX

)ˆ Score(G max arg GG

C

S

B

DX

Unknown graphComplete data: optimization (search in space of graphs)Incomplete data:

EM plus Multiple Imputation,structural EM,mixture models

– learn graph and parameters

Learning Parameters:complete data

ML-estimate: )|(logmax

DP - decomposable!

MAP-estimate

(Bayesian statistics))()|(logmax

PDP

Conjugate priors - Dirichlet ),...,|( ,,1 XXX mDir papapa

X

C B

XPa

)|(

,

X

x

xPX

papa

Multinomial ) ML(

,

,,

xx

xx

X

X

X N

N

pa

papa

counts

) MAP(,,

,,,

xx

xx

xxx

XX

XX

X N

N

papa

papapa

Equivalent sample size

(prior knowledge)

Complete data – local computations

Incomplete data (scorenon-decomposable):stochasticmethods

Learning graph structure

NP-hard optimization Heuristic search:

GmaxargFind )ˆ Score(G G

C

S

BC

S

B

Add S->B

C

S

B

Delete S->B

C

S

B

Reverse S->B

Constrained-based methods

Data impose independence relations (constraints)

Learning BNs: incomplete data Learning parameters

EM algorithm [Lauritzen, 95] Gibbs Sampling [Heckerman, 96] Gradient Descent [Russell et al., 96]

Learning both structure and parameters Sum over missing values [Cooper &

Herskovits, 92; Cooper, 95] Monte-Carlo approaches [Heckerman, 96] Gaussian approximation [Heckerman, 96] Structural EM [Friedman, 98] EM and Multiple Imputation [Singh 97,98,00]

Learning Parameters:incomplete data

EM-algorithm:iterate until convergence

Initial parameters

Current model)(G,Θ

Non-decomposable marginal likelihood (hidden nodes)

S X D C B <? 0 1 0

1> <1 1 ? 0

1> <0 0 0 ? ?

> <? ? 0 ?

1> ………

Data

S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ………..

Expected counts

Expectation Inference: P(S|X=0,D=1,C=0,B=1)

Update parameters (ML, MAP)

Maximization

Learning Parameters:incomplete data

Complete-data log-likelihood is

E step

Compute E( Nijk | Yobs, M step

Compute

E( Nijk | Yobs, E( Nij | Yobs,

(Lauritzen, 95)

Nijk log ijk

Learning structure: incomplete data

Depends on the type of missing data - missing independent of anything else (MCAR) OR missing based on values of other variables (MAR)

While MCAR can be resolved by decomposable scores, MAR cannot

For likelihood-based methods, no need to explicitly model missing data mechanism

Very few attempts at MAR: stochastic methods


Approximate EM by using Multiple Imputation to yield efficient Monte-Carlo method

[Singh 97, 98, 00] trade-off between performance & quality learned network almost optimal approximate complete-data log-likelihood

function using Multiple Imputation yields decomposable score, dependent only on

each node & its parents converges to local maxima of observed-data

likelihood


Scoring functions:Minimum Description Length (MDL)

Learning data compression

Other: MDL = -BIC (Bayesian Information Criterion) Bayesian score (BDe) - asymptotically equivalent to MDL

||2

log),|(log)|(

NGDPDBNMDL

DL(Model)

DL(Data|model)

<9.7 0.6 8 14

18> <0.2 1.3 5 ?? ??

> <1.3 2.8 ?? 0 1

> <?? 5.6 0 10 ??

> ……………….

Learning Structure plus ParametersLearning Structure plus Parameters

p Y D p Y M D p M DM

( | ) ( | , ) ( | )

No. of models is super exponential

Alternatives: Model Selection or Model Averaging

Model SelectionModel Selection

Generally, choose a single model M*.Equivalent to saying P(M*|D) = 1

p Y D p Y M D( | ) ( | , )*

Task is now to: 1) define a metric to decide which model is best 2) search for that model through the space of all models

One Reasonable Score:One Reasonable Score:Posterior Probability of a StructurePosterior Probability of a Structure

p S D p S p D S

p S p D S p S d

h h h

hs

hs

hs

( | ) ( ) ( | )

( ) ( | , ) ( | )

structureprior

parameterprior

likelihood

Global and Local Predictive ScoresGlobal and Local Predictive Scores

[Spiegelhalter et al 93]

log ( | ) log ( | , , , )

log ( | ) log ( | , ) log ( | , , )

p D S p S

p S p S p S

hl l

h

l

m

h h h

x x x

x x x x x x

1 11

1 2 1 3 1 2

Bayes’ factor p D S p D Sh h( | )| ( | )0

Local is useful for diagnostic problems

Local Predictive ScoreLocal Predictive ScoreSpiegelhalter et al. (1993)Spiegelhalter et al. (1993)

pred(S p y d d Shl l l

h

l

m

) log ( | , ,..., , ) x 1 1

1

Ydisease

X1

X2

Xn symptoms...

Exact computation of Exact computation of p(D|Sh)

No missing dataNo missing data Cases are independent, given the model.Cases are independent, given the model. Uniform priors on parametersUniform priors on parameters discrete variablesdiscrete variables

p D S g ihi

i

n

( | ) ( , )

1

[Cooper & Herskovits, 92]

Bayesian Dirichlet ScoreBayesian Dirichlet ScoreCooper and Herskovits (1991)Cooper and Herskovits (1991)

p D SN

Nh ij

ij ijj

q

i

nijk ijk

ijkk

ri i

( | )( )

( )

( )

( )

11 1

N X x

r X

q X

N N

ijk i i ij

i i

i i

ij ijkk

r

ij ijkk

ri i

:

:

:

# cases where = and =

number of states of

number of instances of parents of

ik Pa pa

1 1

Learning BNs without specifying an orderingLearning BNs without specifying an ordering

n! ordering; ordering greatly affects the quality of n! ordering; ordering greatly affects the quality of network learned.network learned.

use conditional independence tests, and d-use conditional independence tests, and d-separation to get an orderingseparation to get an ordering

[Singh & Valtorta’ 95]

Learning BNs via the MDL principleLearning BNs via the MDL principle

Idea: best model is that which gives the most Idea: best model is that which gives the most compact representation of the datacompact representation of the data

So, encode the data using the model plus encode So, encode the data using the model plus encode the model. Minimize this.the model. Minimize this.

[Lam & Bacchus, 93]

Learning BNs: summary

Bayesian Networks – graphical probabilistic models Efficient representation and inference Expert knowledge + learning from data Learning:

parameters (parameter estimation, EM) structure (optimization w/ score functions – e.g., MDL)

Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI))

Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.

A Tutorial on Inference and Learning

Health & Medicine

z nodes x

y nodes x

independence assumptions

bayesian networks ispx

pclass data

data available

p lung cancer

bns useful