CS b351 Statistical Learning

CS B351STATISTICAL LEARNING

AGENDA Learning coin flips, learning Bayes net

parameters Likelihood functions, maximum likelihood

estimation (MLE) Priors, maximum a posteriori estimation

(MAP) Bayesian estimation

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Intuition: c/N might be a good hypothesis for

the fraction of cherries in the bag “Intuitive” parameter estimate: empirical

distribution P(cherry) c / N(Why is this reasonable? Perhaps we got a bad draw!)

LEARNING COIN FLIPS Observe that c out of N draws are cherries

(data) Let the unknown fraction of cherries be q

(hypothesis) Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d)

LEARNING COIN FLIPS Probability of drawing a cherry is q Assumption: draws are independent and

identically distributed (i.i.d) Probability of drawing 2 cherries is q*q Probability of drawing 2 limes is (1-q)2

Probability of drawing 1 cherry and 1 lime: q*(1-q)

LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,

…,dN} given the hypothesis q P(d|q) = Pj P(dj|q)

i.i.d assumption


…,dN} given the hypothesis q P(d|q) = Pj P(dj|q) = Pj

q if dj=Cherry1-q if dj=Lime

Probability model, assuming q is given


…,dN} given the hypothesis q P(d|q) = Pj P(dj|q) = Pj

= qc (1-q)N-c

Gather c cherry terms together, then N-c lime terms

q if dj=Cherry1-q if dj=Lime

MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q

P(d|q) = qc (1-q)N-c

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(da

ta|q

)



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(da

ta|q

)



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.020.040.060.080.1

0.120.140.16

2/3 cherry

q

P(da

ta|q

)



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(da

ta|q

)



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.0050.01

0.0150.02

0.0250.03

0.0350.04

2/5 cherry

q

P(da

ta|q

)



0 0.10.20.30.40.50.60.70.80.9 10

0.0000002

0.0000004

0.0000006

0.0000008

0.000001

0.0000012

10/20 cherry

q

P(da

ta|q

)



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-312E-313E-314E-315E-316E-317E-318E-319E-31

50/100 cherry

q

P(da

ta|q

)

MAXIMUM LIKELIHOOD Peaks of likelihood function seem to hover

around the fraction of cherries… Sharpness indicates some notion of

certainty…

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1E-312E-313E-314E-315E-316E-317E-318E-319E-31

50/100 cherry

q

P(da

ta|q

)

MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the

maximum likelihood estimate (MLE)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1/1 cherry

q

P(da

ta|q

)

q=1 is MLE

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

2/2 cherry

q

P(da

ta|q

)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the


q=1 is MLE

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.020.040.060.080.1

0.120.140.16

2/3 cherry

q

P(da

ta|q



q=2/3 is MLE

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.01

0.02

0.03

0.04

0.05

0.06

0.07

2/4 cherry

q

P(da

ta|q



q=1/2 is MLE

MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the


q=2/5 is MLE

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.0050.01

0.0150.02

0.0250.03

0.0350.04

2/5 cherry

q

P(da

ta|q

)

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]


= log [ qc ] + log [(1-q)N-c]


= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)

PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq(q) = 0 gives the maximum likelihood

estimate

PROOF: EMPIRICAL FREQUENCY IS THE MLE dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0

…=> q = c/N

MAXIMUM LIKELIHOOD FOR BN For any BN, the ML parameters of any CPT

can be derived by the fraction of observed values in the data, conditioned on matched parent values

Alarm

Earthquake Burglar

E: 500 B: 200

N=1000

P(E) = 0.5 P(B) = 0.2

A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380

E B P(A|E,B)T T 0.95F T 0.95T F 0.34F F 0.003

FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])

Each d[i] is a complete example of all variables in the Bayes net

Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN




Suppose BN has a single variable X Estimate X’s CPT, P(X) X




Suppose BN has a single variable X Estimate X’s CPT, P(X) (Just learning a coin flip as usual) PMLE(X) = empirical distribution of D

PMLE(X=T) = Count(X=T) / M PMLE(X=F) = Count(X=F) / M

X




Suppose BN to the right: Estimate P(X), P(Y|X)

Estimate PMLE(X) as usualX

Y




Estimate PMLE(Y|X) with…X

Y

P(Y|X)

XT F

YT

F




Estimate PMLE(Y|X) with…X

Y

P(Y|X)

XT F

YT Count(Y=T,X=T)

/ Count(X=T)Count(Y=T,X=F)

/ Count(X=F)F Count(Y=F,X=T)

/ Count(X=T)Count(Y=F,X=F) /

Count(X=F)




In general, for P(Y|X1,…,Xk): For each setting of (y,x1,…,xk):

Compute Count(y, x1,…,xk) Compute Count(x1,…,xk) Set

X2

Y

X1 X3

OTHER MLE RESULTS Categorical distributions (Non-binary discrete

variables): empirical distribution is MLE Make histogram, divide by N

Continuous Gaussian distributions Mean = average of data Standard deviation = standard deviation of data

0 50 100 150 200 2500

0.0010.0020.0030.0040.0050.0060.0070.0080.009

0 20 40 60 80 1001201401601802000

0.050.1

0.150.2

0.250.3

0.350.4

Gaussian (normal) distributionHistogram

NICE PROPERTIES OF MLE Easy to compute (for certain probability

models) With enough data, the qMLE estimate will

approach the true unknown value of q

PROBLEMS WITH MLE The MLE was easy to compute… but what

happens when we don’t have much data? Motivation

You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

PROBLEMS WITH MLE The MLE was easy to compute… but what

happens when we don’t have much data? Motivation

You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?

qMLE has a high variance with small sample sizes

VARIANCE OF AN ESTIMATOR: INTUITION The dataset D is just a sample of the underlying

distribution, and if we could “do over” the sample, then we might get a new dataset D’.

With D’, our MLE estimate qMLE’ might be different

How much? How often?

Assume all values of q are equally likely In the case of 1 draw, D would have just as likely been

a Lime. In that case, qMLE = 0 So with probability 0.5, qMLE would be 1, and with the

same probability, qMLE would be 0. High variance: typical “do overs” give drastically

different results!

IS THERE A BETTER WAY? BAYESIAN LEARNING

AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION P(D|q) is the likelihood P(q) is the hypothesis prior P(q|D) = 1/Z P(D|q) P(q) is the posterior

Distribution of hypotheses given the data

q

d[1] d[2] d[M]

BAYESIAN PREDICTION For a new draw Y: use hypothesis posterior to

predict P(Y|D)

Y

q

d[1] d[2] d[M]

CANDY EXAMPLE• Candy comes in 2 flavors, cherry and lime, with identical

wrappers• Manufacturer makes 5 indistinguishable bags

• Suppose we draw• What bag are we holding? What flavor will we draw

next?

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

BAYESIAN LEARNING Main idea: Compute the probability of each

hypothesis, given the data Data D: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

BAYESIAN LEARNING Main idea: Compute the probability of each

hypothesis, given the data Data D: Hypotheses: h1,…,h5

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

P(hi|D)

P(D|hi)

We want this…

But all we have is this!

USING BAYES’ RULE P(hi|D) = a P(D|hi) P(hi) is the posterior

(Recall, 1/a = P(D) = Si P(D|hi) P(hi)) P(D|hi) is the likelihood P(hi) is the hypothesis prior

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

COMPUTING THE POSTERIOR Assume draws are independent Let P(h1),…,P(h5) = (0.1, 0.2, 0.4, 0.2, 0.1) D = { 10 x }

P(D|h1) = 0P(D|h2) = 0.2510

P(D|h3) = 0.510

P(D|h4) = 0.7510

P(D|h5) = 110

P(D|h1)P(h1)=0P(D|h2)P(h2)=9e-8P(D|h3)P(h3)=4e-4P(D|h4)P(h4)=0.011P(D|h5)P(h5)=0.1

P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90

Sum = 1/a = 0.1114

POSTERIOR HYPOTHESES

PREDICTING THE NEXT DRAW P(Y|d) = Si P(Y|hi,D)P(hi|D)

= Si P(Y|hi)P(hi|D)

P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90

H

D Y

P(Y|h1) =0P(Y|h2) =0.25P(Y|h3) =0.5P(Y|h4) =0.75P(Y|h5) =1

Probability that next candy drawn is a lime

P(Y|D) = 0.975

P(NEXT CANDY IS LIME | D)

BACK TO COIN FLIPS: UNIFORM PRIOR, BERNOULLI DISTRIBUTION Assume P(q) is uniform P(q|D) = 1/Z P(D|q) = 1/Z qc(1-q)N-c

What’s P(Y|D)?

qi

d[1] d[2] d[M]

Y

ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION

=>Z = c! (N-c)! / (N+1)! =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!

= (c+1) / (N+2)

qi

d[1] d[2] d[M]

Y

Can think of this as a “correction” using “virtual counts”

NONUNIFORM PRIORS P(q|D) P(D|q)P(q) = qc (1-q)N-c P(q)

Define, for all q, the probability that I believe in q

10 q

P(q)

BETA DISTRIBUTION Betaa,b(q) = g qa-1 (1-q)b-1

a, b hyperparameters > 0 g is a normalization

constant a=b=1 is uniform

distribution

POSTERIOR WITH BETA PRIORPosterior qc (1-q)N-c P(q)

= g qc+a-1 (1-q)N-c+b-1

= Betaa+c,b+N-c(q)

Prediction = meanE[q]=(c+a)/(N+a+b)

POSTERIOR WITH BETA PRIORWhat does this mean? Prior specifies a “virtual

count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b

Effect of prior diminishes with more data

CHOOSING A PRIOR Part of the design process; must be chosen

according to your intuition Uninformed belief a=b=1, strong belief => a,b high

FITTING CPTS VIA MAP M examples D=(d[1],…,d[M]), virtual counts

a, b Estimate PMLE(Y|X) by assuming we’ve seen a

examples of Y=T, and b examples of Y=F

P(Y|X) XT F

Y

T (Count(Y=T,X=T)+a) / (Count(X=T)

+a+b)

(Count(Y=T,X=F) +a)/ (Count(X=F)

+a+b)F (Count(Y=F,X=T)

+b)/ (Count(X=T)+a+b)

(Count(Y=F,X=F)+b)/ (Count(X=F)+a+b)

PROPERTIES OF MAP Approaches the MLE as dataset grows large

(effect of prior diminishes in the face of evidence)

More stable estimates than MLE with small sample sizes

Needs a designer’s judgment to set the prior

EXTENSIONS OF BETA PRIORS Parameters of multi-valued (categorical)

distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in

practice still takes the form of “virtual counts”

0 20 40 60 80 1001201401601802000

0.050.1

0.150.2

0.250.3

0.35

0 20 40 60 80 1001201401601802000

0.05

0.1

0.15

0.2

0.25

0 20 40 60 80 1001201401601802000

0.050.1

0.150.2

0.250.3

0.350.4

0 20 40 60 80 1001201401601802000

0.020.040.060.080.1

0.120.140.160.18

0 1

5 10

RECAP Parameter learning via coin flips

Maximum Likelihood Bayesian Learning with Beta prior

Learning Bayes net parameters

NEXT TIME Introduction to machine learning R&N 18.1-3

CS b351 Statistical Learning

Documents

c log q nc log

log qc log

qnc proof

q proof

c cherry terms

q assumption

q hypothesisprobability

nc lime termsq