Top Banner
Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,[email protected]
96
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Tutorial on Inference and Learning

A Tutorial on Inference and Learning in Bayesian Networks

Irina Rish Moninder Singh

IBM T.J.Watson Research Centerrish,[email protected]

Page 2: A Tutorial on Inference and Learning

“Road map” Introduction: Bayesian networks

What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc

Probabilistic inference Exact inference Approximate inference

Learning Bayesian Networks Learning parameters Learning graph structure

Summary

Page 3: A Tutorial on Inference and Learning

Bayesian Networks

= P(A) P(S) P(T|A) P(L|S) P(B|S) P(C|T,L) P(D|T,L,B)

P(A, S, T, L, B, C, D)

Conditional Independencies Efficient Representation

Θ) (G,BN

CPD: T L B D=0 D=10 0 0 0.1 0.90 0 1 0.7 0.30 1 0 0.8 0.20 1 1 0.9 0.1 ...

Lung Cancer

Smoking

Chest X-ray

Bronchitis

Dyspnoea

Tuberculosis

Visit to Asia

P(D|T,L,B)

P(B|S)

P(S)

P(C|T,L)

P(L|S)

P(A)

P(T|A)

[Lauritzen & Spiegelhalter, 95]

Page 4: A Tutorial on Inference and Learning

Bayesian Networks Structured, graphical representation of

probabilistic relationships between several random variables

Explicit representation of conditional independencies

Missing arcs encode conditional independence

Efficient representation of joint pdf Allows arbitrary queries to be answeredP (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

Page 5: A Tutorial on Inference and Learning

Example: Printer Troubleshooting (Microsoft Windows 95)

Print OutputOK

Correct Driver

UncorruptedDriver

CorrectPrinter Path

Net CableConnected

Net/LocalPrinting

Printer On and Online

CorrectLocal Port

Correct Printer

Selected

Local CableConnected

ApplicationOutput OK

PrintSpooling On

Correct Driver

Settings

Printer MemoryAdequate

NetworkUp

SpooledData OK

GDI DataInput OK

GDI Data Output OK

PrintData OK

PC to PrinterTransport OK

PrinterData OK

SpoolProcess OK

NetPath OK

LocalPath OK

PaperLoaded

Local DiskSpace Adequate

[Heckerman, 95]

Page 6: A Tutorial on Inference and Learning

[Heckerman, 95]

Example: Microsoft Pregnancy and Child Care)

Page 7: A Tutorial on Inference and Learning

[Heckerman, 95]

Example: Microsoft Pregnancy and Child Care)

Page 8: A Tutorial on Inference and Learning

Independence Assumptions

Tuberculosis

Visit to Asia

Chest X-ray

Head-to-tail

Lung Cancer

Smoking

Bronchitis

tail-to-tail

Dyspnoea

Lung Cancer Bronchitis

Head-to-head

Page 9: A Tutorial on Inference and Learning

Independence Assumptions Nodes X and Y are d-connected by nodes

in Z along a trail from X to Y if every head-to-head node along the trail is in Z

or has a descendant in Z every other node along the trail is not in Z

Nodes X and Y are d-separated by nodes in Z if they are not d-connected by Z along any trail from X to Y

Nodes X and Y are d-separated by Z implies X and Y are conditionally independent given Z

Page 10: A Tutorial on Inference and Learning

Independence AssumptionsA variable (node) is conditionally independent of its

non-descendants given its parents

Lung Cancer

Smoking

Bronchitis

Dyspnoea

Tuberculosis

Visit to Asia

Chest X-ray

Page 11: A Tutorial on Inference and Learning

Independence Assumptions

Cancer

Smoking

Lung Tumor

Diet

Serum Calcium

Age Gender

Exposure to Toxins

Cancer is independentof Diet given Exposure to Toxinsand Smoking

[Breese & Koller, 97]

Page 12: A Tutorial on Inference and Learning

Independence AssumptionsWhat this means is that joint pdf can be represented as product

of local distributions

P(A,S,T,L,B,C,D) = P(A) . P(S|A) . P(T|A,S) . P(L|A,S,T) . P(B|A,S,T,L) . P(C|A,S,T,L,B) . P(D|A,S,T,L,B,C)

= P(A) . P(S) . P(T|A) . P(L|S) .P(B|S) . P(C|T,L) . P(D|T,L,B)

Lung Cancer

Smoking

Bronchitis

Dyspnoea

Tuberculosis

Visit to Asia

Chest X-ray

Page 13: A Tutorial on Inference and Learning

Thus, the General Product rule for Bayesian Networks is

P(X1,X2,…,Xn) = P(Xi | Pa(Xi))

where Pa(Xi) is the set of parents of Xi

Independence Assumptions

i=1

n

Page 14: A Tutorial on Inference and Learning

The Knowledge Acquisition Task Variables:

collectively exhaustive, mutually exclusive values clarity test: value should be knowable in principle

Structure if data available, can be learned constructed by hand (using “expert” knowledge) variable ordering matters: causal knowledge usually

simplifies

Probabilities can be learned from data second decimal usually does not matter; relative probs sensitivity analysis

Page 15: A Tutorial on Inference and Learning

The Knowledge Acquisition Task

Fuel Gauge StartBattery TurnOver

Variable Order is Important

BatteryTurnOverStart FuelGauge

Fuel

Gauge

Start

Battery

TurnOver

Causal Knowledge Simplifies Construction

Page 16: A Tutorial on Inference and Learning

Naive Baysian Classifiers [Duda&Hart; Langley 92]Naive Baysian Classifiers [Duda&Hart; Langley 92]

Selective Naive Bayesian Classifiers [Langley & Sage 94]Selective Naive Bayesian Classifiers [Langley & Sage 94]

Conditional Trees [Geiger 92; Friedman et al 97]Conditional Trees [Geiger 92; Friedman et al 97]

The Knowledge Acquisition Task

Page 17: A Tutorial on Inference and Learning

The Knowledge Acquisition Task

Selective Bayesian Networks [Singh & Provan, 95;96]

Page 18: A Tutorial on Inference and Learning

Diagnosis: P(cause|symptom)=?

Medicine Bio-informatics

Computer troubleshooting

Stock market

Text Classification

Speechrecognition

Prediction: P(symptom|cause)=?

classmax Classification: P(class|

data) Decision-making (given a cost function) Data mining: induce best model from data

What are BNs useful for?

Page 19: A Tutorial on Inference and Learning

What are BNs useful for?

Cause

Effect

Predictive Inference

Cause

Effect

Diagnostic Reasoning

Unknown butimportant

ImperfectObservations

Value

Decision

Known Predisposing

Factors

Decision Making - Max. Expected Utility

Page 20: A Tutorial on Inference and Learning

What are BNs useful for?

SalientObservations

Fault 1Fault 2Fault 3

.

.

.

Assignmentof Belief

Act Now! Halt?Yes

No

Next BestObservation

(Value of Information)

New Obs.

Probability of fault “i”

Exp

ecte

d U

tili

ty

Do nothing

Action 2

Action 1

Value of Information

Page 21: A Tutorial on Inference and Learning

Why use BNs?

Explicit management of uncertainty Modularity implies maintainability Better, flexible and robust decision

making - MEU, VOI Can be used to answer arbitrary queries -

multiple fault problems Easy to incorporate prior knowledge Easy to understand

Page 22: A Tutorial on Inference and Learning

Application Examples Intellipath

commercial version of Pathfinder lymph-node diseases (60), 100 findings

APRI system developed at AT&T Bell Labs learns & uses Bayesian networks from data to identify

customers liable to default on bill payments

NASA Vista system predict failures in propulsion systems considers time criticality & suggests highest utility

action dynamically decide what information to show

Page 23: A Tutorial on Inference and Learning

Application Examples Answer Wizard in MS Office 95/ MS Project

Bayesian network based free-text help facility uses naive Bayesian classifiers

Office Assistant in MS Office 97 Extension of Answer wizard uses naïve Bayesian networks help based on past experience (keyboard/mouse use)

and task user is doing currently This is the “smiley face” you get in your MS Office

applications

Page 24: A Tutorial on Inference and Learning

Application Examples Microsoft Pregnancy and Child-Care

Available on MSN in Health section Frequently occuring children’s symptoms are linked

to expert modules that repeatedly ask parents relevant questions

Asks next best question based on provided information

Presents articles that are deemed relevant based on information provided

Page 25: A Tutorial on Inference and Learning

Application Examples Printer troubleshooting

HP bought 40% stake in HUGIN. Developing printer troubleshooters for HP printers

Microsoft has 70+ online troubleshooters on their web site use Bayesian networks - multiple faults models, incorporate

utilities

Fax machine troubleshooting Ricoh uses Bayesian network based troubleshooters at

call centers Enabled Ricoh to answer twice the number of calls in half

the time

Page 26: A Tutorial on Inference and Learning

Application Examples

Page 27: A Tutorial on Inference and Learning

Application Examples

Page 28: A Tutorial on Inference and Learning

Application Examples

Page 29: A Tutorial on Inference and Learning

Online/print resources on BNs

Conferences & Journals UAI, ICML, AAAI, AISTAT, KDD MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI

Books and Papers Bayesian Networks without Tears by Eugene

Charniak. AI Magazine: Winter 1991. Probabilistic Reasoning in Intelligent Systems

by Judea Pearl. Morgan Kaufmann: 1998. Probabilistic Reasoning in Expert Systems by

Richard Neapolitan. Wiley: 1990. CACM special issue on Real-world applications

of BNs, March 1995

Page 30: A Tutorial on Inference and Learning

Online/Print Resources on BNs

Wealth of online information at www.auai.org Links to

Electronic proceedings for UAI conferences Other sites with information on BNs and

reasoning under uncertainty Several tutorials and important articles Research groups & companies working in this

area Other societies, mailing lists and conferences

Page 31: A Tutorial on Inference and Learning

Publicly available s/w for BNs

List of BN software maintained by Russell Almond at bayes.stat.washington.edu/almond/belief.html

several free packages: generally research only commercial packages: most powerful (&

expensive) is HUGIN; others include Netica and Dxpress

we are working on developing a Java based BN toolkit here at Watson - will also work within ABLE

Page 32: A Tutorial on Inference and Learning

“Road map” Introduction: Bayesian networks

What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc

Probabilistic inference Exact inference Approximate inference

Learning Bayesian Networks Learning parameters Learning graph structure

Summary

Page 33: A Tutorial on Inference and Learning

Probabilistic Inference Tasks

X/A

a

*k

*1 e),xP(maxarg)a,...,(a

evidence)|xP(X)BEL(X iii

Belief updating:

Finding most probable explanation (MPE)

Finding maximum a-posteriory hypothesis

Finding maximum-expected-utility (MEU) decision

e),xP(maxarg*xx

)xU(e),xP(maxarg)d,...,(d X/D

d

*k

*1

variableshypothesis: XA

function utilityx variablesdecision

: )( :

UXD

Page 34: A Tutorial on Inference and Learning

Belief Updating

lung Cancer

Smoking

X-ray

Bronchitis

Dyspnoea

P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

Page 35: A Tutorial on Inference and Learning

Belief updating: P(X|evidence)=?

“Moral” graph

A

D E

CB

P(a|e=0) P(a,e=0)=

bcde ,,,0

P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)=

0e

P(a) d

),,,( ecdahB

b

P(b|a)P(d|b,a)P(e|b,c)

B C

ED

Variable Elimination

P(c|a)c

Page 36: A Tutorial on Inference and Learning

Bucket elimination Algorithm elim-bel (Dechter 1996)

b

Elimination operator

P(a|e=0)

W*=4”induced width” (max clique size)

bucket B:

P(a)

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e=0

B

C

D

E

A

e)(a,hD

(a)hE

e)c,d,(a,hB

e)d,(a,hC

Page 37: A Tutorial on Inference and Learning

b

maxElimination operator

MPE

W*=4”induced width” (max clique size)

bucket B:

P(a)

P(c|a)

P(b|a) P(d|b,a) P(e|b,c)

bucket C:

bucket D:

bucket E:

bucket A:

e=0

B

C

D

E

A

e)(a,hD

(a)hE

e)c,d,(a,hB

e)d,(a,hC

Finding Algorithm elim-mpe (Dechter 1996)

)xP(maxMPEx

),|(),|()|()|()(maxby replaced is

,,,,cbePbadPabPacPaPMPE

:

bcdea max

Page 38: A Tutorial on Inference and Learning

Generating the MPE-tuple

C:

E:

P(b|a) P(d|b,a) P(e|b,c)B:

D:

A: P(a)

P(c|a)

e=0 e)(a,hD

(a)hE

e)c,d,(a,hB

e)d,(a,hC

(a)hP(a)max arga' 1. E

a

0e' 2.

)e'd,,(a'hmax argd' 3. C

d

)e'c,,d',(a'h

)a'|P(cmax argc' 4.B

c

)c'b,|P(e')a'b,|P(d')a'|P(bmax argb' 5.

b

)e',d',c',b',(a' Return

Page 39: A Tutorial on Inference and Learning

Complexity of inference

))((exp ( * dwnOddw ordering along graph moral of widthinduced the)(*

The effect of the ordering:

4)( 1* dw 2)( 2

* dw“Moral” graph

A

D E

CB

B

C

D

E

A

E

D

C

B

A

Page 40: A Tutorial on Inference and Learning

Other tasks and algorithms

MAP and MEU tasks: Similar bucket-elimination algorithms - elim-map, elim-

meu (Dechter 1996) Elimination operation: either summation or maximization Restriction on variable ordering: summation must precede

maximization (i.e. hypothesis or decision variables are eliminated last)

Other inference algorithms: Join-tree clustering Pearl’s poly-tree propagation Conditioning, etc.

Page 41: A Tutorial on Inference and Learning

Relationship with join-tree clustering

))()())

(a || haPAbucket(a,b hP(b|a) ||bucket(B)(a,b hP(c|a) ||bucket(C)

P(d|a,b)bucket(D) P(e|b,c)bucket(E)

B

C

D

ED,C,B,A, :Ordering

ABC

BCE

ADBA cluster is a set of buckets (a “super-bucket”)

Page 42: A Tutorial on Inference and Learning

Relationship with Pearl’s belief propagation in poly-trees

Pearl’s belief propagation for single-root query

1X

2Z

1Z

3U

1Y

1U

2U

3Z

1Z 2Z 3Z

1U 2U 3U

1X

1Y

)|(

)(

11

11

uzP

uZ

elim-bel using topological ordering and super-buckets for

families

Elim-bel, elim-mpe, and elim-map are linear for poly-trees.

)( 22uZ )( 33

uZ

)( 11xY

“Diagnostic support”

“Causal support”

)( 1x

Page 43: A Tutorial on Inference and Learning

“Road map”

Introduction: Bayesian networks Probabilistic inference

Exact inference Approximate inference

Learning Bayesian Networks Learning parameters Learning graph structure

Summary

Page 44: A Tutorial on Inference and Learning

Inference is NP-hard => approximations

exp(w*)) O(n

Approximations:

Local inference Stochastic simulations Variational approximations etc.

S

X D

BCC B

DX

Page 45: A Tutorial on Inference and Learning

Local Inference Idea

Page 46: A Tutorial on Inference and Learning

Bucket-elimination approximation: “mini-buckets”

Local inference idea: bound the size of recorded dependencies

Computation in a bucket is time and space exponential in the number of variables involved

Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables

Page 47: A Tutorial on Inference and Learning

Mini-bucket approximation: MPE task

Split a bucket into mini-buckets =>bound complexity

XX gh )()()O(e :decrease complexity lExponentia n rnr eOeO

Page 48: A Tutorial on Inference and Learning

Approx-mpe(i)

Input: i – max number of variables allowed in a mini-bucket Output: [lower bound (P of a sub-optimal solution), upper bound]

Example: approx-mpe(3) versus elim-mpe

2* w 4* w

Page 49: A Tutorial on Inference and Learning

Properties of approx-mpe(i)

Complexity: O(exp(2i)) time and O(exp(i)) time.

Accuracy: determined by upper/lower (U/L) bound.

As i increases, both accuracy and complexity increase.

Possible use of mini-bucket approximations: As anytime algorithms (Dechter and Rish, 1997) As heuristics in best-first search (Kask and Dechter,

1999)

Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997)

Page 50: A Tutorial on Inference and Learning

Anytime Approximation

UL

L

U

mpe(i)-approxL mpe(i)-approxU

iii

ii

step

smallest theand largest the

solution return ,11

far so found solutionbest thekeepby computed boundlower by computed boundupper

available are resources space and time

0

returnend

if

While :Initialize

)mpe(-anytime

Page 51: A Tutorial on Inference and Learning

Empirical Evaluation(Dechter and Rish, 1997; Rish, 1999)

Randomly generated networks Uniform random probabilities Random noisy-OR

CPCS networks Probabilistic decoding

Comparing approx-mpe and anytime-mpe

versus elim-mpe

Page 52: A Tutorial on Inference and Learning

Random networks Uniform random: 60 nodes, 90 edges (200 instances)

In 80% of cases, 10-100 times speed-up while U/L<2 Noisy-OR – even better results

Exact elim-mpe was infeasible; appprox-mpe took 0.1 to 80 sec.

q

iy

in qqyyxPi

parameter noise random),...,|0(1

1

Page 53: A Tutorial on Inference and Learning

Anytime-mpe(0.0001) U/L error vs time

Time and parameter i

1 10 100 1000

Up

pe

r/L

ow

er

0.6

1.0

1.4

1.8

2.2

2.6

3.0

3.4

3.8 cpcs422b cpcs360b

i=1 i=21

CPCS networks – medical diagnosis(noisy-OR model)

Test case: no evidence

505.2 70.3anytime-mpe( ),

110.5 70.3anytime-mpe( ),

1697.6 115.8elim-mpe

cpcs422 cpcs360 AlgorithmTime (sec)

410 110

Page 54: A Tutorial on Inference and Learning

log(U/L)

0 2 4 6 8 10 12 0

100

200

300

400

500

600

700

800

900

1000

Freq

uenc

y

log(U/L) histogram for i=10 on 1000 instances of random evidence

log(U/L) histogram for i=10 on 1000 instances of likely evidence

log(U/L)

0 1 2 3 4 5 6 7 8 9 10 11 12 0

100

200

300

400

500

600

700

800

900

1000

Freq

uenc

y

Effect of evidence

More likely evidence=>higher MPE => higher accuracy (why?)

Likely evidence versus random (unlikely) evidence

Page 55: A Tutorial on Inference and Learning

Probabilistic decoding

Error-correcting linear block code

State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks)

Page 56: A Tutorial on Inference and Learning

approx-mpe vs. IBP

codes *w-low onbetter is mpe-approxcodes w*)-(high generatedrandomly onbetter is IBP

Bit error rate (BER) as a function of noise (sigma):

Page 57: A Tutorial on Inference and Learning

Mini-buckets: summary

Mini-buckets – local inference approximation

Idea: bound size of recorded functions

Approx-mpe(i) - mini-bucket algorithm for MPE Better results for noisy-OR than for random

problems Accuracy increases with decreasing noise in Accuracy increases for likely evidence Sparser graphs -> higher accuracy Coding networks: approx-mpe outperfroms IBP on

low-induced width codes

Page 58: A Tutorial on Inference and Learning

“Road map”

Introduction: Bayesian networks Probabilistic inference

Exact inference Approximate inference

Local inference Stochastic simulations Variational approximations

Learning Bayesian Networks Summary

Page 59: A Tutorial on Inference and Learning

Approximation via Sampling

(MCMC) sampling Gibbs * weighinglikelihood *

:ues their val tonodes evidence clamping"" - sampling) forward (e.g., rejection-acceptance -

? Eevidence handle How to3.

, #

)(

:sfrequencieby iesprobabilit Estimate2. )x,...,x,(xs where),s,...,s(

: ( from samples Generate 1.

in

i2

i1

iN1

N

yYwithsamplesyYP

PN

SX)

Page 60: A Tutorial on Inference and Learning

Forward Sampling(logic sampling (Henrion, 1988))

2 step and 1 5.: , and .4

)|( from sample 3. to .2

to# 1.

withconsistent samples :),...,( ordering an

samples, of # - evidence, - :1

goto

ixXEX

paxPxXn1i

N1sample

EN XXoancestral

NE

iii

iiii

n

sample rejectif

forFor

Output

Input

Page 61: A Tutorial on Inference and Learning

Forward sampling (example)

1X

2X 3X

4X

)( 1xP

)|( 12 xxP

),|( 324 xxxP

)|( 13 xxP

)|( from sample 5.otherwise 1, fromstart and

samplereject 0, If .4)|( from Sample .3)|( from Sample .2

)( from Sample .1 sample generate//

0 :Evidence

3,244

3

133

122

11

3

xxxPx

xxxPxxxPx

xPxk

X

Drawback: high rejection rate!

Page 62: A Tutorial on Inference and Learning

Likelihood Weighing(Fung and Chang, 1990; Shachter and Peot, 1990)

y Y wheres

EXi

1

)lescore(sampE)|y P(YThenscores normalize .7

)|P(ele)score(samp .6)|( from sample 5.

.4 to# 3.

.),...,( :nodes theof an Find2.

. assign , 1.

i

amples

i

iiii

i

n

iii

papaxPxX

EXN1sample

XXoorderingancestral

exEX

forFor

each For

Works well for likely evidence!

“Clamping” evidence+forward sampling+ weighing samples by evidence likelihood

Page 63: A Tutorial on Inference and Learning

Gibbs Sampling(Geman and Geman, 1984)

Markov Chain Monte Carlo (MCMC):create a Markov chain of samples

}){\|( from sample 5. .4

to# 3. , 2.

. , 1.

iiii

i

ii

iii

XXxPxXEX

N1samplevaluerandomxEX

exEX

forFor

each For each For

Advantage: guaranteed to converge to P(X)Disadvantage: convergence may be slow

Page 64: A Tutorial on Inference and Learning

Gibbs Sampling (cont’d)(Pearl, 1988)

ij chX

jjiiii paxPpaxPXXxP )|()|(}){\|(

:locally computed is }){\|( :Important ii XXxP

iX )()( jj chX

jiii pachpaXM

Markov blanket:

nodesother all oft independen is parents), their andchildren, (parents,

Given

iX

blanketMarkov

Page 65: A Tutorial on Inference and Learning

“Road map”

Introduction: Bayesian networks Probabilistic inference

Exact inference Approximate inference

Local inference Stochastic simulations Variational approximations

Learning Bayesian Networks Summary

Page 66: A Tutorial on Inference and Learning

Variational Approximations

Idea: variational transformation of CPDs simplifies

inferenceAdvantages: Compute upper and lower bounds on P(Y) Usually faster than sampling techniquesDisadvantages: More complex and less general: must be

derived for each particular form of CPD functions

Page 67: A Tutorial on Inference and Learning

Variational bounds: example

log(x) 1log

}1log{min

)log(

x

x

x

parameter lvariationa -

This approach can be generalized for any concave (convex) function in order to compute its upper (lower) bounds: convex duality (Jaakkola and Jordan, 1997)

Page 68: A Tutorial on Inference and Learning

Convex duality (Jaakkola and Jordan, 1997)

bounds. lowerconvex

bounds upper

function dualconcave

get we,)( For .2

)()( )()(

get weand

)}({min)(

)}({min)(:s.t. )( a hasit is )( If 1.

*

*

*

*

*

xf

xfxffxxf

xfxf

fxxf f ,xf

T

T

T

x

T

Page 69: A Tutorial on Inference and Learning

Example: QMR-DT network(Quick Medical Reference – Decision-Theoretic (Shwe et al., 1991))

Noisy-OR model:

ij

j

pad

dijii qqdfP )1()1()|0( 0

1d 2d kd

1f 2f 3f nf

600 diseases

4000 findings

1log- where

)|0(

,0

)-q(

edfP

ijij

jdii

ipajd ij

Page 70: A Tutorial on Inference and Learning

Inference in QMR-DT

Inference complexity: O(exp(min{p,k})) p = # of positive findings, k = max family size(Heckerman, 1989 (“Quickscore”), Rish and Dechter, 1998)

jii dj

fi

fi dPdfPdfP

dPdfPfdP)( )|( )|(

)()|(),(

01

j

ij

ifij

i

i

ipajd ij

d

padf

i

f

jdi

ee

e

][

0

0

0

0

0

1

0 )1(i

ipajd ij

f

jdie

Positive evidence “couples” the disease nodes

k,...,dd

fdPfdP2

),( )|( 1 :Inference

factorized

factorized

Page 71: A Tutorial on Inference and Learning

Variational approach to QMR-DT(Jaakkola and Jordan, 1997)

ipajd

jijiii

ipajd iji

ipajd ij

dfifjdi

i

jdii

x

eeedfP

edfP

fdualconcaveexf

][)|1(

:by bounded be can 1)|1( Then

)1ln()1(ln)( a has and is )1ln()(

)(0

)()0(

0

*

**

The effect of positive evidence is now factorized (diseases are “decoupled”)

Page 72: A Tutorial on Inference and Learning

Variational approximations

Bounds on local CPDs yield a bound on posterior

Two approaches: sequential and block Sequential: applies variational

transformation to (a subset of) nodes sequentially during inference using a heuristic node ordering; then optimizes across variational parameters

Block: selects in advance nodes to be transformed, then selects variational parameters minimizing the KL-distance between true and approximate posteriors

Page 73: A Tutorial on Inference and Learning

Block approach

distance (KL)Leibler - Kullback theis )||( where

)||(minarg Find

bounds iational their var withCPDs some replacingafter ionapproximat),|(

evidence given ofposterior exact )|(

*

PQDPQD

EYQEYEYP

)(

)(log)()||(

S SP

SQSQPQD

Page 74: A Tutorial on Inference and Learning

Inference in BN: summary

Exact inference is often intractable => need approximations Approximation principles:

Approximating elimination – local inference, bounding size of dependencies among variables (cliques in a problem’s graph).

Mini-buckets, IBP Other approximations: stochastic simulations, variational

techniques, etc. Further research:

Combining “orthogonal” approximation approaches Better understanding of “what works well where”: which

approximation suits which problem structure Other approximation paradigms (e.g., other ways of

approximating probabilities, constraints, cost functions)

Page 75: A Tutorial on Inference and Learning

“Road map”

Introduction: Bayesian networks Probabilistic inference

Exact inference Approximate inference

Learning Bayesian Networks Learning parameters Learning graph structure

Summary

Page 76: A Tutorial on Inference and Learning

Why learn Bayesian networks?

Incremental learning: P(H) or

S C Learning causal relationships:

Efficient representation and inference

Handling missing data: <1.3 2.8 ?? 0 1 >

<9.7 0.6 8 14

18> <0.2 1.3 5 ?? ??

> <1.3 2.8 ?? 0 1

> <?? 5.6 0 10 ??

> ……………….

Combining domain expert knowledge with data

Page 77: A Tutorial on Inference and Learning

Learning Bayesian Networks

Known graph

C

S

B

DX

Complete data: parameter estimation (ML, MAP)Incomplete data: non-linear parametric optimization (gradient descent, EM)

P(S)

P(B|S)

P(X|C,S)

P(C|S)

P(D|C,B)

– learn parameters

CS

B

DX

)ˆ Score(G max arg GG

C

S

B

DX

Unknown graphComplete data: optimization (search in space of graphs)Incomplete data:

EM plus Multiple Imputation,structural EM,mixture models

– learn graph and parameters

Page 78: A Tutorial on Inference and Learning

Learning Parameters:complete data

ML-estimate: )|(logmax

DP - decomposable!

MAP-estimate

(Bayesian statistics))()|(logmax

PDP

Conjugate priors - Dirichlet ),...,|( ,,1 XXX mDir papapa

X

C B

XPa

)|(

,

X

x

xPX

papa

Multinomial ) ML(

,

,,

xx

xx

X

X

X N

N

pa

papa

counts

) MAP(,,

,,,

xx

xx

xxx

XX

XX

X N

N

papa

papapa

Equivalent sample size

(prior knowledge)

Page 79: A Tutorial on Inference and Learning

Complete data – local computations

Incomplete data (scorenon-decomposable):stochasticmethods

Learning graph structure

NP-hard optimization Heuristic search:

GmaxargFind )ˆ Score(G G

C

S

BC

S

B

Add S->B

C

S

B

Delete S->B

C

S

B

Reverse S->B

Constrained-based methods

Data impose independence relations (constraints)

Page 80: A Tutorial on Inference and Learning

Learning BNs: incomplete data Learning parameters

EM algorithm [Lauritzen, 95] Gibbs Sampling [Heckerman, 96] Gradient Descent [Russell et al., 96]

Learning both structure and parameters Sum over missing values [Cooper &

Herskovits, 92; Cooper, 95] Monte-Carlo approaches [Heckerman, 96] Gaussian approximation [Heckerman, 96] Structural EM [Friedman, 98] EM and Multiple Imputation [Singh 97,98,00]

Page 81: A Tutorial on Inference and Learning

Learning Parameters:incomplete data

EM-algorithm:iterate until convergence

Initial parameters

Current model)(G,Θ

Non-decomposable marginal likelihood (hidden nodes)

S X D C B <? 0 1 0

1> <1 1 ? 0

1> <0 0 0 ? ?

> <? ? 0 ?

1> ………

Data

S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ………..

Expected counts

Expectation Inference: P(S|X=0,D=1,C=0,B=1)

Update parameters (ML, MAP)

Maximization

Page 82: A Tutorial on Inference and Learning

Learning Parameters:incomplete data

Complete-data log-likelihood is

E step

Compute E( Nijk | Yobs, M step

Compute

E( Nijk | Yobs, E( Nij | Yobs,

(Lauritzen, 95)

Nijk log ijk

Page 83: A Tutorial on Inference and Learning

Learning structure: incomplete data

Depends on the type of missing data - missing independent of anything else (MCAR) OR missing based on values of other variables (MAR)

While MCAR can be resolved by decomposable scores, MAR cannot

For likelihood-based methods, no need to explicitly model missing data mechanism

Very few attempts at MAR: stochastic methods

Page 84: A Tutorial on Inference and Learning

Learning structure: incomplete data

Approximate EM by using Multiple Imputation to yield efficient Monte-Carlo method

[Singh 97, 98, 00] trade-off between performance & quality learned network almost optimal approximate complete-data log-likelihood

function using Multiple Imputation yields decomposable score, dependent only on

each node & its parents converges to local maxima of observed-data

likelihood

Page 85: A Tutorial on Inference and Learning

Learning structure: incomplete data

Page 86: A Tutorial on Inference and Learning

Scoring functions:Minimum Description Length (MDL)

Learning data compression

Other: MDL = -BIC (Bayesian Information Criterion) Bayesian score (BDe) - asymptotically equivalent to MDL

||2

log),|(log)|(

NGDPDBNMDL

DL(Model)

DL(Data|model)

<9.7 0.6 8 14

18> <0.2 1.3 5 ?? ??

> <1.3 2.8 ?? 0 1

> <?? 5.6 0 10 ??

> ……………….

Page 87: A Tutorial on Inference and Learning

Learning Structure plus ParametersLearning Structure plus Parameters

p Y D p Y M D p M DM

( | ) ( | , ) ( | )

No. of models is super exponential

Alternatives: Model Selection or Model Averaging

Page 88: A Tutorial on Inference and Learning

Model SelectionModel Selection

Generally, choose a single model M*.Equivalent to saying P(M*|D) = 1

p Y D p Y M D( | ) ( | , )*

Task is now to: 1) define a metric to decide which model is best 2) search for that model through the space of all models

Page 89: A Tutorial on Inference and Learning

One Reasonable Score:One Reasonable Score:Posterior Probability of a StructurePosterior Probability of a Structure

p S D p S p D S

p S p D S p S d

h h h

hs

hs

hs

( | ) ( ) ( | )

( ) ( | , ) ( | )

structureprior

parameterprior

likelihood

Page 90: A Tutorial on Inference and Learning

Global and Local Predictive ScoresGlobal and Local Predictive Scores

[Spiegelhalter et al 93]

log ( | ) log ( | , , , )

log ( | ) log ( | , ) log ( | , , )

p D S p S

p S p S p S

hl l

h

l

m

h h h

x x x

x x x x x x

1 11

1 2 1 3 1 2

Bayes’ factor p D S p D Sh h( | )| ( | )0

Local is useful for diagnostic problems

Page 91: A Tutorial on Inference and Learning

Local Predictive ScoreLocal Predictive ScoreSpiegelhalter et al. (1993)Spiegelhalter et al. (1993)

pred(S p y d d Shl l l

h

l

m

) log ( | , ,..., , ) x 1 1

1

Ydisease

X1

X2

Xn symptoms...

Page 92: A Tutorial on Inference and Learning

Exact computation of Exact computation of p(D|Sh)

No missing dataNo missing data Cases are independent, given the model.Cases are independent, given the model. Uniform priors on parametersUniform priors on parameters discrete variablesdiscrete variables

p D S g ihi

i

n

( | ) ( , )

1

[Cooper & Herskovits, 92]

Page 93: A Tutorial on Inference and Learning

Bayesian Dirichlet ScoreBayesian Dirichlet ScoreCooper and Herskovits (1991)Cooper and Herskovits (1991)

p D SN

Nh ij

ij ijj

q

i

nijk ijk

ijkk

ri i

( | )( )

( )

( )

( )

11 1

N X x

r X

q X

N N

ijk i i ij

i i

i i

ij ijkk

r

ij ijkk

ri i

:

:

:

# cases where = and =

number of states of

number of instances of parents of

ik Pa pa

1 1

Page 94: A Tutorial on Inference and Learning

Learning BNs without specifying an orderingLearning BNs without specifying an ordering

n! ordering; ordering greatly affects the quality of n! ordering; ordering greatly affects the quality of network learned.network learned.

use conditional independence tests, and d-use conditional independence tests, and d-separation to get an orderingseparation to get an ordering

[Singh & Valtorta’ 95]

Page 95: A Tutorial on Inference and Learning

Learning BNs via the MDL principleLearning BNs via the MDL principle

Idea: best model is that which gives the most Idea: best model is that which gives the most compact representation of the datacompact representation of the data

So, encode the data using the model plus encode So, encode the data using the model plus encode the model. Minimize this.the model. Minimize this.

[Lam & Bacchus, 93]

Page 96: A Tutorial on Inference and Learning

Learning BNs: summary

Bayesian Networks – graphical probabilistic models Efficient representation and inference Expert knowledge + learning from data Learning:

parameters (parameter estimation, EM) structure (optimization w/ score functions – e.g., MDL)

Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI))

Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.