Inference and Learning in Bayesian Networks Irina Rish Moninder Singh IBM T.J.Watson Research Center rish,[email protected]
May 10, 2015
A Tutorial on Inference and Learning in Bayesian Networks
Irina Rish Moninder Singh
IBM T.J.Watson Research Centerrish,[email protected]
“Road map” Introduction: Bayesian networks
What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc
Probabilistic inference Exact inference Approximate inference
Learning Bayesian Networks Learning parameters Learning graph structure
Summary
Bayesian Networks
= P(A) P(S) P(T|A) P(L|S) P(B|S) P(C|T,L) P(D|T,L,B)
P(A, S, T, L, B, C, D)
Conditional Independencies Efficient Representation
Θ) (G,BN
CPD: T L B D=0 D=10 0 0 0.1 0.90 0 1 0.7 0.30 1 0 0.8 0.20 1 1 0.9 0.1 ...
Lung Cancer
Smoking
Chest X-ray
Bronchitis
Dyspnoea
Tuberculosis
Visit to Asia
P(D|T,L,B)
P(B|S)
P(S)
P(C|T,L)
P(L|S)
P(A)
P(T|A)
[Lauritzen & Spiegelhalter, 95]
Bayesian Networks Structured, graphical representation of
probabilistic relationships between several random variables
Explicit representation of conditional independencies
Missing arcs encode conditional independence
Efficient representation of joint pdf Allows arbitrary queries to be answeredP (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
Example: Printer Troubleshooting (Microsoft Windows 95)
Print OutputOK
Correct Driver
UncorruptedDriver
CorrectPrinter Path
Net CableConnected
Net/LocalPrinting
Printer On and Online
CorrectLocal Port
Correct Printer
Selected
Local CableConnected
ApplicationOutput OK
PrintSpooling On
Correct Driver
Settings
Printer MemoryAdequate
NetworkUp
SpooledData OK
GDI DataInput OK
GDI Data Output OK
PrintData OK
PC to PrinterTransport OK
PrinterData OK
SpoolProcess OK
NetPath OK
LocalPath OK
PaperLoaded
Local DiskSpace Adequate
[Heckerman, 95]
[Heckerman, 95]
Example: Microsoft Pregnancy and Child Care)
[Heckerman, 95]
Example: Microsoft Pregnancy and Child Care)
Independence Assumptions
Tuberculosis
Visit to Asia
Chest X-ray
Head-to-tail
Lung Cancer
Smoking
Bronchitis
tail-to-tail
Dyspnoea
Lung Cancer Bronchitis
Head-to-head
Independence Assumptions Nodes X and Y are d-connected by nodes
in Z along a trail from X to Y if every head-to-head node along the trail is in Z
or has a descendant in Z every other node along the trail is not in Z
Nodes X and Y are d-separated by nodes in Z if they are not d-connected by Z along any trail from X to Y
Nodes X and Y are d-separated by Z implies X and Y are conditionally independent given Z
Independence AssumptionsA variable (node) is conditionally independent of its
non-descendants given its parents
Lung Cancer
Smoking
Bronchitis
Dyspnoea
Tuberculosis
Visit to Asia
Chest X-ray
Independence Assumptions
Cancer
Smoking
Lung Tumor
Diet
Serum Calcium
Age Gender
Exposure to Toxins
Cancer is independentof Diet given Exposure to Toxinsand Smoking
[Breese & Koller, 97]
Independence AssumptionsWhat this means is that joint pdf can be represented as product
of local distributions
P(A,S,T,L,B,C,D) = P(A) . P(S|A) . P(T|A,S) . P(L|A,S,T) . P(B|A,S,T,L) . P(C|A,S,T,L,B) . P(D|A,S,T,L,B,C)
= P(A) . P(S) . P(T|A) . P(L|S) .P(B|S) . P(C|T,L) . P(D|T,L,B)
Lung Cancer
Smoking
Bronchitis
Dyspnoea
Tuberculosis
Visit to Asia
Chest X-ray
Thus, the General Product rule for Bayesian Networks is
P(X1,X2,…,Xn) = P(Xi | Pa(Xi))
where Pa(Xi) is the set of parents of Xi
Independence Assumptions
i=1
n
The Knowledge Acquisition Task Variables:
collectively exhaustive, mutually exclusive values clarity test: value should be knowable in principle
Structure if data available, can be learned constructed by hand (using “expert” knowledge) variable ordering matters: causal knowledge usually
simplifies
Probabilities can be learned from data second decimal usually does not matter; relative probs sensitivity analysis
The Knowledge Acquisition Task
Fuel Gauge StartBattery TurnOver
Variable Order is Important
BatteryTurnOverStart FuelGauge
Fuel
Gauge
Start
Battery
TurnOver
Causal Knowledge Simplifies Construction
Naive Baysian Classifiers [Duda&Hart; Langley 92]Naive Baysian Classifiers [Duda&Hart; Langley 92]
Selective Naive Bayesian Classifiers [Langley & Sage 94]Selective Naive Bayesian Classifiers [Langley & Sage 94]
Conditional Trees [Geiger 92; Friedman et al 97]Conditional Trees [Geiger 92; Friedman et al 97]
The Knowledge Acquisition Task
The Knowledge Acquisition Task
Selective Bayesian Networks [Singh & Provan, 95;96]
Diagnosis: P(cause|symptom)=?
Medicine Bio-informatics
Computer troubleshooting
Stock market
Text Classification
Speechrecognition
Prediction: P(symptom|cause)=?
classmax Classification: P(class|
data) Decision-making (given a cost function) Data mining: induce best model from data
What are BNs useful for?
What are BNs useful for?
Cause
Effect
Predictive Inference
Cause
Effect
Diagnostic Reasoning
Unknown butimportant
ImperfectObservations
Value
Decision
Known Predisposing
Factors
Decision Making - Max. Expected Utility
What are BNs useful for?
SalientObservations
Fault 1Fault 2Fault 3
.
.
.
Assignmentof Belief
Act Now! Halt?Yes
No
Next BestObservation
(Value of Information)
New Obs.
Probability of fault “i”
Exp
ecte
d U
tili
ty
Do nothing
Action 2
Action 1
Value of Information
Why use BNs?
Explicit management of uncertainty Modularity implies maintainability Better, flexible and robust decision
making - MEU, VOI Can be used to answer arbitrary queries -
multiple fault problems Easy to incorporate prior knowledge Easy to understand
Application Examples Intellipath
commercial version of Pathfinder lymph-node diseases (60), 100 findings
APRI system developed at AT&T Bell Labs learns & uses Bayesian networks from data to identify
customers liable to default on bill payments
NASA Vista system predict failures in propulsion systems considers time criticality & suggests highest utility
action dynamically decide what information to show
Application Examples Answer Wizard in MS Office 95/ MS Project
Bayesian network based free-text help facility uses naive Bayesian classifiers
Office Assistant in MS Office 97 Extension of Answer wizard uses naïve Bayesian networks help based on past experience (keyboard/mouse use)
and task user is doing currently This is the “smiley face” you get in your MS Office
applications
Application Examples Microsoft Pregnancy and Child-Care
Available on MSN in Health section Frequently occuring children’s symptoms are linked
to expert modules that repeatedly ask parents relevant questions
Asks next best question based on provided information
Presents articles that are deemed relevant based on information provided
Application Examples Printer troubleshooting
HP bought 40% stake in HUGIN. Developing printer troubleshooters for HP printers
Microsoft has 70+ online troubleshooters on their web site use Bayesian networks - multiple faults models, incorporate
utilities
Fax machine troubleshooting Ricoh uses Bayesian network based troubleshooters at
call centers Enabled Ricoh to answer twice the number of calls in half
the time
Application Examples
Application Examples
Application Examples
Online/print resources on BNs
Conferences & Journals UAI, ICML, AAAI, AISTAT, KDD MLJ, DM&KD, JAIR, IEEE KDD, IJAR, IEEE PAMI
Books and Papers Bayesian Networks without Tears by Eugene
Charniak. AI Magazine: Winter 1991. Probabilistic Reasoning in Intelligent Systems
by Judea Pearl. Morgan Kaufmann: 1998. Probabilistic Reasoning in Expert Systems by
Richard Neapolitan. Wiley: 1990. CACM special issue on Real-world applications
of BNs, March 1995
Online/Print Resources on BNs
Wealth of online information at www.auai.org Links to
Electronic proceedings for UAI conferences Other sites with information on BNs and
reasoning under uncertainty Several tutorials and important articles Research groups & companies working in this
area Other societies, mailing lists and conferences
Publicly available s/w for BNs
List of BN software maintained by Russell Almond at bayes.stat.washington.edu/almond/belief.html
several free packages: generally research only commercial packages: most powerful (&
expensive) is HUGIN; others include Netica and Dxpress
we are working on developing a Java based BN toolkit here at Watson - will also work within ABLE
“Road map” Introduction: Bayesian networks
What are BNs: representation, types, etc Why use BNs: Applications (classes) of BNs Information sources, software, etc
Probabilistic inference Exact inference Approximate inference
Learning Bayesian Networks Learning parameters Learning graph structure
Summary
Probabilistic Inference Tasks
X/A
a
*k
*1 e),xP(maxarg)a,...,(a
evidence)|xP(X)BEL(X iii
Belief updating:
Finding most probable explanation (MPE)
Finding maximum a-posteriory hypothesis
Finding maximum-expected-utility (MEU) decision
e),xP(maxarg*xx
)xU(e),xP(maxarg)d,...,(d X/D
d
*k
*1
variableshypothesis: XA
function utilityx variablesdecision
: )( :
UXD
Belief Updating
lung Cancer
Smoking
X-ray
Bronchitis
Dyspnoea
P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
Belief updating: P(X|evidence)=?
“Moral” graph
A
D E
CB
P(a|e=0) P(a,e=0)=
bcde ,,,0
P(a)P(b|a)P(c|a)P(d|b,a)P(e|b,c)=
0e
P(a) d
),,,( ecdahB
b
P(b|a)P(d|b,a)P(e|b,c)
B C
ED
Variable Elimination
P(c|a)c
Bucket elimination Algorithm elim-bel (Dechter 1996)
b
Elimination operator
P(a|e=0)
W*=4”induced width” (max clique size)
bucket B:
P(a)
P(c|a)
P(b|a) P(d|b,a) P(e|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
e=0
B
C
D
E
A
e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
b
maxElimination operator
MPE
W*=4”induced width” (max clique size)
bucket B:
P(a)
P(c|a)
P(b|a) P(d|b,a) P(e|b,c)
bucket C:
bucket D:
bucket E:
bucket A:
e=0
B
C
D
E
A
e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
Finding Algorithm elim-mpe (Dechter 1996)
)xP(maxMPEx
),|(),|()|()|()(maxby replaced is
,,,,cbePbadPabPacPaPMPE
:
bcdea max
Generating the MPE-tuple
C:
E:
P(b|a) P(d|b,a) P(e|b,c)B:
D:
A: P(a)
P(c|a)
e=0 e)(a,hD
(a)hE
e)c,d,(a,hB
e)d,(a,hC
(a)hP(a)max arga' 1. E
a
0e' 2.
)e'd,,(a'hmax argd' 3. C
d
)e'c,,d',(a'h
)a'|P(cmax argc' 4.B
c
)c'b,|P(e')a'b,|P(d')a'|P(bmax argb' 5.
b
)e',d',c',b',(a' Return
Complexity of inference
))((exp ( * dwnOddw ordering along graph moral of widthinduced the)(*
The effect of the ordering:
4)( 1* dw 2)( 2
* dw“Moral” graph
A
D E
CB
B
C
D
E
A
E
D
C
B
A
Other tasks and algorithms
MAP and MEU tasks: Similar bucket-elimination algorithms - elim-map, elim-
meu (Dechter 1996) Elimination operation: either summation or maximization Restriction on variable ordering: summation must precede
maximization (i.e. hypothesis or decision variables are eliminated last)
Other inference algorithms: Join-tree clustering Pearl’s poly-tree propagation Conditioning, etc.
Relationship with join-tree clustering
))()())
(a || haPAbucket(a,b hP(b|a) ||bucket(B)(a,b hP(c|a) ||bucket(C)
P(d|a,b)bucket(D) P(e|b,c)bucket(E)
B
C
D
ED,C,B,A, :Ordering
ABC
BCE
ADBA cluster is a set of buckets (a “super-bucket”)
Relationship with Pearl’s belief propagation in poly-trees
Pearl’s belief propagation for single-root query
1X
2Z
1Z
3U
1Y
1U
2U
3Z
1Z 2Z 3Z
1U 2U 3U
1X
1Y
)|(
)(
11
11
uzP
uZ
elim-bel using topological ordering and super-buckets for
families
Elim-bel, elim-mpe, and elim-map are linear for poly-trees.
)( 22uZ )( 33
uZ
)( 11xY
“Diagnostic support”
“Causal support”
)( 1x
“Road map”
Introduction: Bayesian networks Probabilistic inference
Exact inference Approximate inference
Learning Bayesian Networks Learning parameters Learning graph structure
Summary
Inference is NP-hard => approximations
exp(w*)) O(n
Approximations:
Local inference Stochastic simulations Variational approximations etc.
S
X D
BCC B
DX
Local Inference Idea
Bucket-elimination approximation: “mini-buckets”
Local inference idea: bound the size of recorded dependencies
Computation in a bucket is time and space exponential in the number of variables involved
Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables
Mini-bucket approximation: MPE task
Split a bucket into mini-buckets =>bound complexity
XX gh )()()O(e :decrease complexity lExponentia n rnr eOeO
Approx-mpe(i)
Input: i – max number of variables allowed in a mini-bucket Output: [lower bound (P of a sub-optimal solution), upper bound]
Example: approx-mpe(3) versus elim-mpe
2* w 4* w
Properties of approx-mpe(i)
Complexity: O(exp(2i)) time and O(exp(i)) time.
Accuracy: determined by upper/lower (U/L) bound.
As i increases, both accuracy and complexity increase.
Possible use of mini-bucket approximations: As anytime algorithms (Dechter and Rish, 1997) As heuristics in best-first search (Kask and Dechter,
1999)
Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997)
Anytime Approximation
UL
L
U
mpe(i)-approxL mpe(i)-approxU
iii
ii
step
smallest theand largest the
solution return ,11
far so found solutionbest thekeepby computed boundlower by computed boundupper
available are resources space and time
0
returnend
if
While :Initialize
)mpe(-anytime
Empirical Evaluation(Dechter and Rish, 1997; Rish, 1999)
Randomly generated networks Uniform random probabilities Random noisy-OR
CPCS networks Probabilistic decoding
Comparing approx-mpe and anytime-mpe
versus elim-mpe
Random networks Uniform random: 60 nodes, 90 edges (200 instances)
In 80% of cases, 10-100 times speed-up while U/L<2 Noisy-OR – even better results
Exact elim-mpe was infeasible; appprox-mpe took 0.1 to 80 sec.
q
iy
in qqyyxPi
parameter noise random),...,|0(1
1
Anytime-mpe(0.0001) U/L error vs time
Time and parameter i
1 10 100 1000
Up
pe
r/L
ow
er
0.6
1.0
1.4
1.8
2.2
2.6
3.0
3.4
3.8 cpcs422b cpcs360b
i=1 i=21
CPCS networks – medical diagnosis(noisy-OR model)
Test case: no evidence
505.2 70.3anytime-mpe( ),
110.5 70.3anytime-mpe( ),
1697.6 115.8elim-mpe
cpcs422 cpcs360 AlgorithmTime (sec)
410 110
log(U/L)
0 2 4 6 8 10 12 0
100
200
300
400
500
600
700
800
900
1000
Freq
uenc
y
log(U/L) histogram for i=10 on 1000 instances of random evidence
log(U/L) histogram for i=10 on 1000 instances of likely evidence
log(U/L)
0 1 2 3 4 5 6 7 8 9 10 11 12 0
100
200
300
400
500
600
700
800
900
1000
Freq
uenc
y
Effect of evidence
More likely evidence=>higher MPE => higher accuracy (why?)
Likely evidence versus random (unlikely) evidence
Probabilistic decoding
Error-correcting linear block code
State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks)
approx-mpe vs. IBP
codes *w-low onbetter is mpe-approxcodes w*)-(high generatedrandomly onbetter is IBP
Bit error rate (BER) as a function of noise (sigma):
Mini-buckets: summary
Mini-buckets – local inference approximation
Idea: bound size of recorded functions
Approx-mpe(i) - mini-bucket algorithm for MPE Better results for noisy-OR than for random
problems Accuracy increases with decreasing noise in Accuracy increases for likely evidence Sparser graphs -> higher accuracy Coding networks: approx-mpe outperfroms IBP on
low-induced width codes
“Road map”
Introduction: Bayesian networks Probabilistic inference
Exact inference Approximate inference
Local inference Stochastic simulations Variational approximations
Learning Bayesian Networks Summary
Approximation via Sampling
(MCMC) sampling Gibbs * weighinglikelihood *
:ues their val tonodes evidence clamping"" - sampling) forward (e.g., rejection-acceptance -
? Eevidence handle How to3.
, #
)(
:sfrequencieby iesprobabilit Estimate2. )x,...,x,(xs where),s,...,s(
: ( from samples Generate 1.
in
i2
i1
iN1
N
yYwithsamplesyYP
PN
SX)
Forward Sampling(logic sampling (Henrion, 1988))
2 step and 1 5.: , and .4
)|( from sample 3. to .2
to# 1.
withconsistent samples :),...,( ordering an
samples, of # - evidence, - :1
goto
ixXEX
paxPxXn1i
N1sample
EN XXoancestral
NE
iii
iiii
n
sample rejectif
forFor
Output
Input
Forward sampling (example)
1X
2X 3X
4X
)( 1xP
)|( 12 xxP
),|( 324 xxxP
)|( 13 xxP
)|( from sample 5.otherwise 1, fromstart and
samplereject 0, If .4)|( from Sample .3)|( from Sample .2
)( from Sample .1 sample generate//
0 :Evidence
3,244
3
133
122
11
3
xxxPx
xxxPxxxPx
xPxk
X
Drawback: high rejection rate!
Likelihood Weighing(Fung and Chang, 1990; Shachter and Peot, 1990)
y Y wheres
EXi
1
)lescore(sampE)|y P(YThenscores normalize .7
)|P(ele)score(samp .6)|( from sample 5.
.4 to# 3.
.),...,( :nodes theof an Find2.
. assign , 1.
i
amples
i
iiii
i
n
iii
papaxPxX
EXN1sample
XXoorderingancestral
exEX
forFor
each For
Works well for likely evidence!
“Clamping” evidence+forward sampling+ weighing samples by evidence likelihood
Gibbs Sampling(Geman and Geman, 1984)
Markov Chain Monte Carlo (MCMC):create a Markov chain of samples
}){\|( from sample 5. .4
to# 3. , 2.
. , 1.
iiii
i
ii
iii
XXxPxXEX
N1samplevaluerandomxEX
exEX
forFor
each For each For
Advantage: guaranteed to converge to P(X)Disadvantage: convergence may be slow
Gibbs Sampling (cont’d)(Pearl, 1988)
ij chX
jjiiii paxPpaxPXXxP )|()|(}){\|(
:locally computed is }){\|( :Important ii XXxP
iX )()( jj chX
jiii pachpaXM
Markov blanket:
nodesother all oft independen is parents), their andchildren, (parents,
Given
iX
blanketMarkov
“Road map”
Introduction: Bayesian networks Probabilistic inference
Exact inference Approximate inference
Local inference Stochastic simulations Variational approximations
Learning Bayesian Networks Summary
Variational Approximations
Idea: variational transformation of CPDs simplifies
inferenceAdvantages: Compute upper and lower bounds on P(Y) Usually faster than sampling techniquesDisadvantages: More complex and less general: must be
derived for each particular form of CPD functions
Variational bounds: example
log(x) 1log
}1log{min
)log(
x
x
x
parameter lvariationa -
This approach can be generalized for any concave (convex) function in order to compute its upper (lower) bounds: convex duality (Jaakkola and Jordan, 1997)
Convex duality (Jaakkola and Jordan, 1997)
bounds. lowerconvex
bounds upper
function dualconcave
get we,)( For .2
)()( )()(
get weand
)}({min)(
)}({min)(:s.t. )( a hasit is )( If 1.
*
*
*
*
*
xf
xfxffxxf
xfxf
fxxf f ,xf
T
T
T
x
T
Example: QMR-DT network(Quick Medical Reference – Decision-Theoretic (Shwe et al., 1991))
Noisy-OR model:
ij
j
pad
dijii qqdfP )1()1()|0( 0
1d 2d kd
1f 2f 3f nf
600 diseases
4000 findings
1log- where
)|0(
,0
)-q(
edfP
ijij
jdii
ipajd ij
Inference in QMR-DT
Inference complexity: O(exp(min{p,k})) p = # of positive findings, k = max family size(Heckerman, 1989 (“Quickscore”), Rish and Dechter, 1998)
jii dj
fi
fi dPdfPdfP
dPdfPfdP)( )|( )|(
)()|(),(
01
j
ij
ifij
i
i
ipajd ij
d
padf
i
f
jdi
ee
e
][
0
0
0
0
0
1
0 )1(i
ipajd ij
f
jdie
Positive evidence “couples” the disease nodes
k,...,dd
fdPfdP2
),( )|( 1 :Inference
factorized
factorized
Variational approach to QMR-DT(Jaakkola and Jordan, 1997)
ipajd
jijiii
ipajd iji
ipajd ij
dfifjdi
i
jdii
x
eeedfP
edfP
fdualconcaveexf
][)|1(
:by bounded be can 1)|1( Then
)1ln()1(ln)( a has and is )1ln()(
)(0
)()0(
0
*
**
The effect of positive evidence is now factorized (diseases are “decoupled”)
Variational approximations
Bounds on local CPDs yield a bound on posterior
Two approaches: sequential and block Sequential: applies variational
transformation to (a subset of) nodes sequentially during inference using a heuristic node ordering; then optimizes across variational parameters
Block: selects in advance nodes to be transformed, then selects variational parameters minimizing the KL-distance between true and approximate posteriors
Block approach
distance (KL)Leibler - Kullback theis )||( where
)||(minarg Find
bounds iational their var withCPDs some replacingafter ionapproximat),|(
evidence given ofposterior exact )|(
*
PQDPQD
EYQEYEYP
)(
)(log)()||(
S SP
SQSQPQD
Inference in BN: summary
Exact inference is often intractable => need approximations Approximation principles:
Approximating elimination – local inference, bounding size of dependencies among variables (cliques in a problem’s graph).
Mini-buckets, IBP Other approximations: stochastic simulations, variational
techniques, etc. Further research:
Combining “orthogonal” approximation approaches Better understanding of “what works well where”: which
approximation suits which problem structure Other approximation paradigms (e.g., other ways of
approximating probabilities, constraints, cost functions)
“Road map”
Introduction: Bayesian networks Probabilistic inference
Exact inference Approximate inference
Learning Bayesian Networks Learning parameters Learning graph structure
Summary
Why learn Bayesian networks?
Incremental learning: P(H) or
S C Learning causal relationships:
Efficient representation and inference
Handling missing data: <1.3 2.8 ?? 0 1 >
<9.7 0.6 8 14
18> <0.2 1.3 5 ?? ??
> <1.3 2.8 ?? 0 1
> <?? 5.6 0 10 ??
> ……………….
Combining domain expert knowledge with data
Learning Bayesian Networks
Known graph
C
S
B
DX
Complete data: parameter estimation (ML, MAP)Incomplete data: non-linear parametric optimization (gradient descent, EM)
P(S)
P(B|S)
P(X|C,S)
P(C|S)
P(D|C,B)
– learn parameters
CS
B
DX
)ˆ Score(G max arg GG
C
S
B
DX
Unknown graphComplete data: optimization (search in space of graphs)Incomplete data:
EM plus Multiple Imputation,structural EM,mixture models
– learn graph and parameters
Learning Parameters:complete data
ML-estimate: )|(logmax
DP - decomposable!
MAP-estimate
(Bayesian statistics))()|(logmax
PDP
Conjugate priors - Dirichlet ),...,|( ,,1 XXX mDir papapa
X
C B
XPa
)|(
,
X
x
xPX
papa
Multinomial ) ML(
,
,,
xx
xx
X
X
X N
N
pa
papa
counts
) MAP(,,
,,,
xx
xx
xxx
XX
XX
X N
N
papa
papapa
Equivalent sample size
(prior knowledge)
Complete data – local computations
Incomplete data (scorenon-decomposable):stochasticmethods
Learning graph structure
NP-hard optimization Heuristic search:
GmaxargFind )ˆ Score(G G
C
S
BC
S
B
Add S->B
C
S
B
Delete S->B
C
S
B
Reverse S->B
Constrained-based methods
Data impose independence relations (constraints)
Learning BNs: incomplete data Learning parameters
EM algorithm [Lauritzen, 95] Gibbs Sampling [Heckerman, 96] Gradient Descent [Russell et al., 96]
Learning both structure and parameters Sum over missing values [Cooper &
Herskovits, 92; Cooper, 95] Monte-Carlo approaches [Heckerman, 96] Gaussian approximation [Heckerman, 96] Structural EM [Friedman, 98] EM and Multiple Imputation [Singh 97,98,00]
Learning Parameters:incomplete data
EM-algorithm:iterate until convergence
Initial parameters
Current model)(G,Θ
Non-decomposable marginal likelihood (hidden nodes)
S X D C B <? 0 1 0
1> <1 1 ? 0
1> <0 0 0 ? ?
> <? ? 0 ?
1> ………
Data
S X D C B 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 1 ………..
Expected counts
Expectation Inference: P(S|X=0,D=1,C=0,B=1)
Update parameters (ML, MAP)
Maximization
Learning Parameters:incomplete data
Complete-data log-likelihood is
E step
Compute E( Nijk | Yobs, M step
Compute
E( Nijk | Yobs, E( Nij | Yobs,
(Lauritzen, 95)
Nijk log ijk
Learning structure: incomplete data
Depends on the type of missing data - missing independent of anything else (MCAR) OR missing based on values of other variables (MAR)
While MCAR can be resolved by decomposable scores, MAR cannot
For likelihood-based methods, no need to explicitly model missing data mechanism
Very few attempts at MAR: stochastic methods
Learning structure: incomplete data
Approximate EM by using Multiple Imputation to yield efficient Monte-Carlo method
[Singh 97, 98, 00] trade-off between performance & quality learned network almost optimal approximate complete-data log-likelihood
function using Multiple Imputation yields decomposable score, dependent only on
each node & its parents converges to local maxima of observed-data
likelihood
Learning structure: incomplete data
Scoring functions:Minimum Description Length (MDL)
Learning data compression
Other: MDL = -BIC (Bayesian Information Criterion) Bayesian score (BDe) - asymptotically equivalent to MDL
||2
log),|(log)|(
NGDPDBNMDL
DL(Model)
DL(Data|model)
<9.7 0.6 8 14
18> <0.2 1.3 5 ?? ??
> <1.3 2.8 ?? 0 1
> <?? 5.6 0 10 ??
> ……………….
Learning Structure plus ParametersLearning Structure plus Parameters
p Y D p Y M D p M DM
( | ) ( | , ) ( | )
No. of models is super exponential
Alternatives: Model Selection or Model Averaging
Model SelectionModel Selection
Generally, choose a single model M*.Equivalent to saying P(M*|D) = 1
p Y D p Y M D( | ) ( | , )*
Task is now to: 1) define a metric to decide which model is best 2) search for that model through the space of all models
One Reasonable Score:One Reasonable Score:Posterior Probability of a StructurePosterior Probability of a Structure
p S D p S p D S
p S p D S p S d
h h h
hs
hs
hs
( | ) ( ) ( | )
( ) ( | , ) ( | )
structureprior
parameterprior
likelihood
Global and Local Predictive ScoresGlobal and Local Predictive Scores
[Spiegelhalter et al 93]
log ( | ) log ( | , , , )
log ( | ) log ( | , ) log ( | , , )
p D S p S
p S p S p S
hl l
h
l
m
h h h
x x x
x x x x x x
1 11
1 2 1 3 1 2
Bayes’ factor p D S p D Sh h( | )| ( | )0
Local is useful for diagnostic problems
Local Predictive ScoreLocal Predictive ScoreSpiegelhalter et al. (1993)Spiegelhalter et al. (1993)
pred(S p y d d Shl l l
h
l
m
) log ( | , ,..., , ) x 1 1
1
Ydisease
X1
X2
Xn symptoms...
Exact computation of Exact computation of p(D|Sh)
No missing dataNo missing data Cases are independent, given the model.Cases are independent, given the model. Uniform priors on parametersUniform priors on parameters discrete variablesdiscrete variables
p D S g ihi
i
n
( | ) ( , )
1
[Cooper & Herskovits, 92]
Bayesian Dirichlet ScoreBayesian Dirichlet ScoreCooper and Herskovits (1991)Cooper and Herskovits (1991)
p D SN
Nh ij
ij ijj
q
i
nijk ijk
ijkk
ri i
( | )( )
( )
( )
( )
11 1
N X x
r X
q X
N N
ijk i i ij
i i
i i
ij ijkk
r
ij ijkk
ri i
:
:
:
# cases where = and =
number of states of
number of instances of parents of
ik Pa pa
1 1
Learning BNs without specifying an orderingLearning BNs without specifying an ordering
n! ordering; ordering greatly affects the quality of n! ordering; ordering greatly affects the quality of network learned.network learned.
use conditional independence tests, and d-use conditional independence tests, and d-separation to get an orderingseparation to get an ordering
[Singh & Valtorta’ 95]
Learning BNs via the MDL principleLearning BNs via the MDL principle
Idea: best model is that which gives the most Idea: best model is that which gives the most compact representation of the datacompact representation of the data
So, encode the data using the model plus encode So, encode the data using the model plus encode the model. Minimize this.the model. Minimize this.
[Lam & Bacchus, 93]
Learning BNs: summary
Bayesian Networks – graphical probabilistic models Efficient representation and inference Expert knowledge + learning from data Learning:
parameters (parameter estimation, EM) structure (optimization w/ score functions – e.g., MDL)
Applications/systems: collaborative filtering (MSBN), fraud detection (AT&T), classification (AutoClass (NASA), TAN-BLT(SRI))
Future directions: causality, time, model evaluation criteria, approximate inference/learning, on-line learning, etc.