CS B351 STATISTICAL LEARNING
Feb 22, 2016
CS B351STATISTICAL LEARNING
AGENDA Learning coin flips, learning Bayes net
parameters Likelihood functions, maximum likelihood
estimation (MLE) Priors, maximum a posteriori estimation
(MAP) Bayesian estimation
LEARNING COIN FLIPS Observe that c out of N draws are cherries
(data) Intuition: c/N might be a good hypothesis for
the fraction of cherries in the bag “Intuitive” parameter estimate: empirical
distribution P(cherry) c / N(Why is this reasonable? Perhaps we got a bad draw!)
LEARNING COIN FLIPS Observe that c out of N draws are cherries
(data) Let the unknown fraction of cherries be q
(hypothesis) Probability of drawing a cherry is q Assumption: draws are independent and
identically distributed (i.i.d)
LEARNING COIN FLIPS Probability of drawing a cherry is q Assumption: draws are independent and
identically distributed (i.i.d) Probability of drawing 2 cherries is q*q Probability of drawing 2 limes is (1-q)2
Probability of drawing 1 cherry and 1 lime: q*(1-q)
LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,
…,dN} given the hypothesis q P(d|q) = Pj P(dj|q)
i.i.d assumption
LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,
…,dN} given the hypothesis q P(d|q) = Pj P(dj|q) = Pj
q if dj=Cherry1-q if dj=Lime
Probability model, assuming q is given
LIKELIHOOD FUNCTION Likelihood: the probability of the data d={d1,
…,dN} given the hypothesis q P(d|q) = Pj P(dj|q) = Pj
= qc (1-q)N-c
Gather c cherry terms together, then N-c lime terms
q if dj=Cherry1-q if dj=Lime
MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1/1 cherry
q
P(da
ta|q
)
MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
2/2 cherry
q
P(da
ta|q
)
MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.020.040.060.080.1
0.120.140.16
2/3 cherry
q
P(da
ta|q
)
MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.01
0.02
0.03
0.04
0.05
0.06
0.07
2/4 cherry
q
P(da
ta|q
)
MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.0050.01
0.0150.02
0.0250.03
0.0350.04
2/5 cherry
q
P(da
ta|q
)
MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0 0.10.20.30.40.50.60.70.80.9 10
0.0000002
0.0000004
0.0000006
0.0000008
0.000001
0.0000012
10/20 cherry
q
P(da
ta|q
)
MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q
P(d|q) = qc (1-q)N-c
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1E-312E-313E-314E-315E-316E-317E-318E-319E-31
50/100 cherry
q
P(da
ta|q
)
MAXIMUM LIKELIHOOD Peaks of likelihood function seem to hover
around the fraction of cherries… Sharpness indicates some notion of
certainty…
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1E-312E-313E-314E-315E-316E-317E-318E-319E-31
50/100 cherry
q
P(da
ta|q
)
MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1/1 cherry
q
P(da
ta|q
)
q=1 is MLE
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
2/2 cherry
q
P(da
ta|q
)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)
q=1 is MLE
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.020.040.060.080.1
0.120.140.16
2/3 cherry
q
P(da
ta|q
)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)
q=2/3 is MLE
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.01
0.02
0.03
0.04
0.05
0.06
0.07
2/4 cherry
q
P(da
ta|q
)MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)
q=1/2 is MLE
MAXIMUM LIKELIHOOD P(d|q) is the likelihood function The quantity argmaxq P(d|q) is known as the
maximum likelihood estimate (MLE)
q=2/5 is MLE
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.0050.01
0.0150.02
0.0250.03
0.0350.04
2/5 cherry
q
P(da
ta|q
)
PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]
PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]
= log [ qc ] + log [(1-q)N-c]
PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = log [ qc (1-q)N-c]
= log [ qc ] + log [(1-q)N-c]= c log q + (N-c) log (1-q)
PROOF: EMPIRICAL FREQUENCY IS THE MLE l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq(q) = 0 gives the maximum likelihood
estimate
PROOF: EMPIRICAL FREQUENCY IS THE MLE dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0
…=> q = c/N
MAXIMUM LIKELIHOOD FOR BN For any BN, the ML parameters of any CPT
can be derived by the fraction of observed values in the data, conditioned on matched parent values
Alarm
Earthquake Burglar
E: 500 B: 200
N=1000
P(E) = 0.5 P(B) = 0.2
A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380
E B P(A|E,B)T T 0.95F T 0.95T F 0.34F F 0.003
FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])
Each d[i] is a complete example of all variables in the Bayes net
Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN
FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])
Each d[i] is a complete example of all variables in the Bayes net
Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN
Suppose BN has a single variable X Estimate X’s CPT, P(X) X
FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])
Each d[i] is a complete example of all variables in the Bayes net
Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN
Suppose BN has a single variable X Estimate X’s CPT, P(X) (Just learning a coin flip as usual) PMLE(X) = empirical distribution of D
PMLE(X=T) = Count(X=T) / M PMLE(X=F) = Count(X=F) / M
X
FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])
Each d[i] is a complete example of all variables in the Bayes net
Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN
Suppose BN to the right: Estimate P(X), P(Y|X)
Estimate PMLE(X) as usualX
Y
FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])
Each d[i] is a complete example of all variables in the Bayes net
Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN
Estimate PMLE(Y|X) with…X
Y
P(Y|X)
XT F
YT
F
FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])
Each d[i] is a complete example of all variables in the Bayes net
Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN
Estimate PMLE(Y|X) with…X
Y
P(Y|X)
XT F
YT Count(Y=T,X=T)
/ Count(X=T)Count(Y=T,X=F)
/ Count(X=F)F Count(Y=F,X=T)
/ Count(X=T)Count(Y=F,X=F) /
Count(X=F)
FITTING CPTS VIA MLE M examples D=(d[1],…,d[M])
Each d[i] is a complete example of all variables in the Bayes net
Assumption: each d[i] is sampled i.i.d. from the joint distribution of the BN
In general, for P(Y|X1,…,Xk): For each setting of (y,x1,…,xk):
Compute Count(y, x1,…,xk) Compute Count(x1,…,xk) Set
X2
Y
X1 X3
OTHER MLE RESULTS Categorical distributions (Non-binary discrete
variables): empirical distribution is MLE Make histogram, divide by N
Continuous Gaussian distributions Mean = average of data Standard deviation = standard deviation of data
0 50 100 150 200 2500
0.0010.0020.0030.0040.0050.0060.0070.0080.009
0 20 40 60 80 1001201401601802000
0.050.1
0.150.2
0.250.3
0.350.4
Gaussian (normal) distributionHistogram
NICE PROPERTIES OF MLE Easy to compute (for certain probability
models) With enough data, the qMLE estimate will
approach the true unknown value of q
PROBLEMS WITH MLE The MLE was easy to compute… but what
happens when we don’t have much data? Motivation
You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?
PROBLEMS WITH MLE The MLE was easy to compute… but what
happens when we don’t have much data? Motivation
You hand me a coin from your pocket 1 flip, turns up tails Whats the MLE?
qMLE has a high variance with small sample sizes
VARIANCE OF AN ESTIMATOR: INTUITION The dataset D is just a sample of the underlying
distribution, and if we could “do over” the sample, then we might get a new dataset D’.
With D’, our MLE estimate qMLE’ might be different
How much? How often?
Assume all values of q are equally likely In the case of 1 draw, D would have just as likely been
a Lime. In that case, qMLE = 0 So with probability 0.5, qMLE would be 1, and with the
same probability, qMLE would be 0. High variance: typical “do overs” give drastically
different results!
IS THERE A BETTER WAY? BAYESIAN LEARNING
AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION P(D|q) is the likelihood P(q) is the hypothesis prior P(q|D) = 1/Z P(D|q) P(q) is the posterior
Distribution of hypotheses given the data
q
d[1] d[2] d[M]
BAYESIAN PREDICTION For a new draw Y: use hypothesis posterior to
predict P(Y|D)
Y
q
d[1] d[2] d[M]
CANDY EXAMPLE• Candy comes in 2 flavors, cherry and lime, with identical
wrappers• Manufacturer makes 5 indistinguishable bags
• Suppose we draw• What bag are we holding? What flavor will we draw
next?
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
BAYESIAN LEARNING Main idea: Compute the probability of each
hypothesis, given the data Data D: Hypotheses: h1,…,h5
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
BAYESIAN LEARNING Main idea: Compute the probability of each
hypothesis, given the data Data D: Hypotheses: h1,…,h5
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
P(hi|D)
P(D|hi)
We want this…
But all we have is this!
USING BAYES’ RULE P(hi|D) = a P(D|hi) P(hi) is the posterior
(Recall, 1/a = P(D) = Si P(D|hi) P(hi)) P(D|hi) is the likelihood P(hi) is the hypothesis prior
h1C: 100%L: 0%
h2C: 75%L: 25%
h3C: 50%L: 50%
h4C: 25%L: 75%
h5C: 0%L: 100%
COMPUTING THE POSTERIOR Assume draws are independent Let P(h1),…,P(h5) = (0.1, 0.2, 0.4, 0.2, 0.1) D = { 10 x }
P(D|h1) = 0P(D|h2) = 0.2510
P(D|h3) = 0.510
P(D|h4) = 0.7510
P(D|h5) = 110
P(D|h1)P(h1)=0P(D|h2)P(h2)=9e-8P(D|h3)P(h3)=4e-4P(D|h4)P(h4)=0.011P(D|h5)P(h5)=0.1
P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90
Sum = 1/a = 0.1114
POSTERIOR HYPOTHESES
PREDICTING THE NEXT DRAW P(Y|d) = Si P(Y|hi,D)P(hi|D)
= Si P(Y|hi)P(hi|D)
P(h1|D) =0P(h2|D) =0.00P(h3|D) =0.00P(h4|D) =0.10P(h5|D) =0.90
H
D Y
P(Y|h1) =0P(Y|h2) =0.25P(Y|h3) =0.5P(Y|h4) =0.75P(Y|h5) =1
Probability that next candy drawn is a lime
P(Y|D) = 0.975
P(NEXT CANDY IS LIME | D)
BACK TO COIN FLIPS: UNIFORM PRIOR, BERNOULLI DISTRIBUTION Assume P(q) is uniform P(q|D) = 1/Z P(D|q) = 1/Z qc(1-q)N-c
What’s P(Y|D)?
qi
d[1] d[2] d[M]
Y
ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION
=>Z = c! (N-c)! / (N+1)! =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)!
= (c+1) / (N+2)
qi
d[1] d[2] d[M]
Y
Can think of this as a “correction” using “virtual counts”
NONUNIFORM PRIORS P(q|D) P(D|q)P(q) = qc (1-q)N-c P(q)
Define, for all q, the probability that I believe in q
10 q
P(q)
BETA DISTRIBUTION Betaa,b(q) = g qa-1 (1-q)b-1
a, b hyperparameters > 0 g is a normalization
constant a=b=1 is uniform
distribution
POSTERIOR WITH BETA PRIORPosterior qc (1-q)N-c P(q)
= g qc+a-1 (1-q)N-c+b-1
= Betaa+c,b+N-c(q)
Prediction = meanE[q]=(c+a)/(N+a+b)
POSTERIOR WITH BETA PRIORWhat does this mean? Prior specifies a “virtual
count” of a=a-1 heads, b=b-1 tailsSee heads, increment aSee tails, increment b
Effect of prior diminishes with more data
CHOOSING A PRIOR Part of the design process; must be chosen
according to your intuition Uninformed belief a=b=1, strong belief => a,b high
FITTING CPTS VIA MAP M examples D=(d[1],…,d[M]), virtual counts
a, b Estimate PMLE(Y|X) by assuming we’ve seen a
examples of Y=T, and b examples of Y=F
P(Y|X) XT F
Y
T (Count(Y=T,X=T)+a) / (Count(X=T)
+a+b)
(Count(Y=T,X=F) +a)/ (Count(X=F)
+a+b)F (Count(Y=F,X=T)
+b)/ (Count(X=T)+a+b)
(Count(Y=F,X=F)+b)/ (Count(X=F)+a+b)
PROPERTIES OF MAP Approaches the MLE as dataset grows large
(effect of prior diminishes in the face of evidence)
More stable estimates than MLE with small sample sizes
Needs a designer’s judgment to set the prior
EXTENSIONS OF BETA PRIORS Parameters of multi-valued (categorical)
distributions, e.g. histograms: Dirichlet prior Mathematical derivation more complex, but in
practice still takes the form of “virtual counts”
0 20 40 60 80 1001201401601802000
0.050.1
0.150.2
0.250.3
0.35
0 20 40 60 80 1001201401601802000
0.05
0.1
0.15
0.2
0.25
0 20 40 60 80 1001201401601802000
0.050.1
0.150.2
0.250.3
0.350.4
0 20 40 60 80 1001201401601802000
0.020.040.060.080.1
0.120.140.160.18
0 1
5 10
RECAP Parameter learning via coin flips
Maximum Likelihood Bayesian Learning with Beta prior
Learning Bayes net parameters
NEXT TIME Introduction to machine learning R&N 18.1-3