Top Banner
CS B351 LEARNING PROBABILISTIC MODELS
59

CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

Dec 30, 2015

Download

Documents

Angel Hodge
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

CS B351LEARNING PROBABILISTIC MODELS

Page 2: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MOTIVATION

Past lectures have studied how to infer characteristics of a distribution, given a fully-specified Bayes net

Next few lectures: where does the Bayes net come from?

Page 3: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

Win?

Strength Opponent Strength

Page 4: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

Win?

Offense strength

Opp. Off.

Strength

Defense strength

Opp. Def.

Strength

Pass yds

Rush yds Rush yds

allowed

Score allowed

Page 5: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

SWin?

Offense strength

Opp. Off.

Strength

Defense strength

Opp. Def.

Strength

Pass yds

Rush yds Rush yds

allowed

Score allowed

Strength of

schedule

At Home

?

Injuries?Opp

injuries?

Page 6: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

SWin?

Offense strength

Opp. Off.

Strength

Defense strength

Opp. Def.

Strength

Pass yds

Rush yds Rush yds

allowed

Score allowed

Strength of

schedule

At Home

?

Injuries?Opp

injuries?

Page 7: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

AGENDA

Learning probability distributions from example data

Influence of structure on performance Maximum likelihood estimation (MLE) Bayesian estimation

Page 8: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

PROBABILISTIC ESTIMATION PROBLEM

Our setting: Given a set of examples drawn from the target

distribution Each example is complete (fully observable)

Goal: Produce some representation of a belief state so

we can perform inferences & draw certain predictions

Page 9: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

DENSITY ESTIMATION

Given dataset D={d[1],…,d[M]} drawn from underlying distribution P*

Find a distribution that matches P* as “close” as possible

High-level issues: Usually, not enough data to get an accurate

picture of P*, which forces us to approximate. Even if we did have P*, how do we define

“closeness” (both theoretically and in practice)? How do we maximize “closeness”?

Page 10: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

WHAT CLASS OF PROBABILITY MODELS?

For small discrete distributions, just use a tabular representation Very efficient learning techniques

For large discrete distributions or continuous ones, the choice of probability model is crucial Increasing complexity =>

Can represent complex distributions more accurately Need more data to learn well (risk of overfitting) More expensive to learn and to perform inference

Page 11: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

TWO LEARNING PROBLEMS

Parameter learning What entries should be put into the model’s

probability tables? Structure learning

Which variables should be represented / transformed for inclusion in the model?

What direct / indirect relationships between variables should be modeled?

More “high level” problem Once structure is chosen, a set of (unestimated)

parameters emerge These need to be estimated using parameter learning

Page 12: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

LEARNING COIN FLIPS Cherry and lime candies are in an opaque

bag Observe that c out of N draws are cherries

(data)

Page 13: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Intuition: c/N might be a good hypothesis for

the fraction of cherries in the bag(or it might not, depending on the draw!)

“Intuitive” parameter estimate: empirical distribution P(cherry) c / N(this will be justified more thoroughly later)

Page 14: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

STRUCTURE LEARNING EXAMPLE: HISTOGRAM BUCKET SIZES

Histograms are used to estimate distributions of continuous or large #s of discrete values… but how fine?

0 20 40 60 80 100

120

140

160

180

200

012345678

0 20 40 60 80 1001201401601802000

2

4

6

8

10

12

14

16

0 100 2000

5

10

15

20

25

30

35

0 16 32 48 64 80 96 112

128

144

160

176

192

0

1

2

3

4

5

6

Page 15: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

STRUCTURE LEARNING: INDEPENDENCE RELATIONSHIPS

Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D)

Case 1: 15 free parameters (16 entries – sum to 1 constraint) P(ABCD) = p1

P(ABCD) = p2

… P(ABCD) = p15

P(ABCD) = 1-p1-…-p15

Case 2: 4 free parameters P(A)=p1, P(A)=1-p1

P(D)=p4, P(D)=1-p4

Page 16: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

STRUCTURE LEARNING: INDEPENDENCE RELATIONSHIPS

Compare table P(A,B,C,D) vs P(A)P(B)P(C)P(D)

P(A,B,C,D) Would be able to fit ALL relationships in the data

P(A)P(B)P(C)P(D) Inherently does not have the capability to

accurately model correlations like A~=B Leads to biased estimates: overestimate or

underestimate the true probabilities

Page 17: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

1

2

3

0

0.1

0.2

0.3

1

2

3

1

2

3

0

0.1

0.2

1

2

3

Original joint distribution P(X,Y)Learned using independence

assumption P(X)P(Y)

XY

YX

Page 18: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

STRUCTURE LEARNING: EXPRESSIVE POWER

Making more independence assumptions always makes a probabilistic model less expressive

If the independence relationships assumed by structure model A are a superset of those in structure B, then B can express any probability distribution that A can

X

Y Z

X

Y Z

X

Y Z

Page 19: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

C

F1 F2 Fk

C

F1 F2 Fk

Or

?

Page 20: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

ARCS DO NOT NECESSARILY ENCODE CAUSALITY!

A

B

C

C

B

A

2 BN’s that can encode the same joint probability distribution

Page 21: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

READING OFF INDEPENDENCE RELATIONSHIPS

Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)?

No! C parent’s (B) are

given, and so it is independent of its non-descendents (A)

Independence is symmetric:C A | B => A C | B

A

B

C

Page 22: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

LEARNING IN THE FACE OF NOISY DATA

Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

X Y

Model 1

X Y

Model 2

Page 23: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

LEARNING IN THE FACE OF NOISY DATA

Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

X Y

Model 1

X Y

Model 2

Parameters estimated via empirical distribution (“Intuitive fit”)

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11

Page 24: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

LEARNING IN THE FACE OF NOISY DATA

Ex: flip two independent coins Dataset of 20 flips: 3 HH, 6 HT, 5 TH, 6 TT

X Y

Model 1

X Y

Model 2

Parameters estimated via empirical distribution (“Intuitive fit”)

P(X=H) = 9/20P(Y=H) = 8/20

P(X=H) = 9/20P(Y=H|X=H) = 3/9P(Y=H|X=T) = 5/11 Errors are

likely to be larger!

Page 25: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

STRUCTURE LEARNING: FIT VS COMPLEXITY

Must trade off fit of data vs. complexity of model

Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity

to noise

Page 26: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

STRUCTURE LEARNING: FIT VS COMPLEXITY

Must trade off fit of data vs. complexity of model

Complex models More parameters to learn More expressive More data fragmentation = greater sensitivity

to noise

Typical approaches explore multiple structures, while optimizing the trade off between fit and complexity

Need a way of measuring “complexity” (e.g., number of edges, number of parameters) and “fit”

Page 27: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

FURTHER READING ON STRUCTURE LEARNING

Structure learning with statistical independence testing

Score-based methods (e.g., Bayesian Information Criterion)

Bayesian methods with structure priors Cross-validated model selection (more on this

later)

Page 28: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

STATISTICAL PARAMETER LEARNING

Page 29: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Let the unknown fraction of cherries be q

(hypothesis) Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d)

Page 30: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

LEARNING COIN FLIPS Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d) Probability of drawing 2 cherries is *q q Probability of drawing 2 limes is (1-q)2

Probability of drawing 1 cherry and 1 lime: *(1- )q q

Page 31: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

LIKELIHOOD FUNCTION

Likelihood of data d={d1,…,dN} given q

P(d|q) = Pj P(dj|q) = qc (1-q)N-c

i.i.d assumption Gather c cherry terms together, then N-c lime terms

Page 32: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(d

ata

|q)

Page 33: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(d

ata

|q)

Page 34: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

2/3 cherry

q

P(d

ata

|q)

Page 35: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(d

ata

|q)

Page 36: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

2/5 cherry

q

P(d

ata

|q)

Page 37: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.10.20.30.40.50.60.70.80.9 10

0.0000002

0.0000004

0.0000006

0.0000008

0.000001

0.0000012

10/20 cherry

q

P(d

ata

|q)

Page 38: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-31

2E-31

3E-31

4E-31

5E-31

6E-31

7E-31

8E-31

9E-31

50/100 cherry

q

P(d

ata

|q)

Page 39: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

Peaks of likelihood function seem to hover around the fraction of cherries…

Sharpness indicates some notion of certainty…

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-31

2E-31

3E-31

4E-31

5E-31

6E-31

7E-31

8E-31

9E-31

50/100 cherry

q

P(d

ata

|q)

Page 40: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

Page 41: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = log [ qc (1-q)N-c]

Page 42: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]

Page 43: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = log [ qc (1-q)N-c]= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

Page 44: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq( )q = 0 gives the maximum likelihood

estimate

Page 45: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD

dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0

=> q = c/N

Page 46: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

OTHER MLE RESULTS

Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram)

Continuous Gaussian distributions Mean = average data Standard deviation = standard deviation of data

Page 47: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION

P(q|d) = 1/ Z P(d|q) P(q) is the posterior Distribution of hypotheses given the data

P(d|q) is the likelihood P(q) is the hypothesis prior

q

d[1] d[2] d[M]

Page 48: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION

Assume P(q) is uniform P(q|d) = 1/ Z P(d|q) = 1/Z qc(1-q)N-c

What’s P(Y|D)?

qi

d[1] d[2] d[M]

Y

Page 49: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION

Assume P(q) is uniform P(q|d) = 1/ Z P(d|q) = 1/Z qc(1-q)N-c

What’s P(Y|D)?

qi

d[1] d[2] d[M]

Y

Page 50: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION

=>Z = c! (N-c)! / (N+1)! =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!

= (c+1) / (N+2)

qi

d[1] d[2] d[M]

Y

Can think of this as a “correction” using “virtual counts”

Page 51: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

NONUNIFORM PRIORS

P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q)

Define, for all q, the probability that I believe in q

10 q

P(q)

Page 52: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

BETA DISTRIBUTION

Betaa,b(q) = g qa-1 (1-q)b-1

a, b hyperparameters > 0 g is a normalization

constant a=b=1 is uniform

distribution

Page 53: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

POSTERIOR WITH BETA PRIOR

Posterior qc (1-q)N-c P(q)= g qc+a-1 (1-q)N-c+b-1

= Betaa+c,b+N-c(q)

Prediction = meanE[ ]q =(c+a)/(N+a+b)

Page 54: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

POSTERIOR WITH BETA PRIOR

What does this mean? Prior specifies a “virtual

count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b

Effect of prior diminishes with more data

Page 55: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

CHOOSING A PRIOR

Part of the design process; must be chosen according to your intuition

Uninformed belief = =1a b , strong belief => ,a b high

Page 56: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

EXTENSIONS OF BETA PRIORS Parameters of multi-valued (categorical)

distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in

practice still takes the form of “virtual counts”

0 20 40 60 80 1001201401601802000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 20 40 60 80 1001201401601802000

0.05

0.1

0.15

0.2

0.25

0 20 40 60 80 1001201401601802000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 20 40 60 80 1001201401601802000

0.020.040.060.080.1

0.120.140.160.18

0 1

5 10

Page 57: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

RECAP

Learning probabilistic models Parameter vs. structure learning Single-parameter learning via coin flips

Maximum Likelihood Bayesian Learning with Beta prior

Page 58: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

MAXIMUM LIKELIHOOD FOR BN

For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values

Alarm

Earthquake Burglar

E: 500 B: 200

N=1000

P(E) = 0.5 P(B) = 0.2

A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380

E B P(A|E,B)

T T 0.95

F T 0.95

T F 0.34

F F 0.003

Page 59: CS B 351 L EARNING P ROBABILISTIC M ODELS. M OTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fully- specified.

FITTING CPTS

Each ML entry P(xi|paXi) is given by examining counts of (xi,paXi) in D and normalizing across rows of the CPT

Note that for large k=|PaXi|, very few datapoints will share the values of paXi! O(|D|/2k), but some values may be even rarer Large domains |Val(Xi)| can also be a problem Data fragmentation