LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION. •Objectives: Discrete Features Maximum Likelihood Bias in ML Estimates Bayesian Estimation Example - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ECE 8443 – Pattern RecognitionECE 8527 – Introduction to Machine Learning and Pattern Recognition
LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION
• Objectives:Discrete FeaturesMaximum LikelihoodBias in ML EstimatesBayesian EstimationExample
• For problems where features are discrete:))( jj |ωPdp xxx
x (
• Bayes formula involves probabilities (not densities):
xx
xx
xx
PPP
PpPp
P jjj
jjj
where
c
jjj PPP
1xx
• Bayes rule remains the same:
)|(minarg* xii
αRα
• The maximum entropy distribution is a uniform distribution:
Discrete Features
NP i
1)( xx
ECE 8527: Lecture 04, Slide 3
• Consider independent binary features:t
dxx ),...,( 1x
• Assuming conditional independence:
ii xi
xi
d
ippωP
1
11 )1()|(x ii x
ixi
d
iqqωP
1
12 )1()|(x
• The likelihood ratio is:
ii
ii
xi
xi
xi
xi
d
i qqpp
ωPωP
1
1
12
1
)1()1(
)|()|(
xx
• The discriminant function is:
Discriminant Functions For Discrete Features
ECE 8527: Lecture 04, Slide 4
• In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(ωi), and class-conditional densities, p(x|ωi).
• What can we do if we do not have this information?• What limitations do we face?• There are two common approaches to parameter estimation: maximum
likelihood and Bayesian estimation.• Maximum Likelihood: treat the parameters as quantities whose values are
fixed but unknown.• Bayes: treat the parameters as random variables having some known prior
distribution. Observations of samples converts this to a posterior.• Bayesian Learning: sharpen the a posteriori density causing it to peak near
the true value.
Introduction to Maximum Likelihood Estimation
ECE 8527: Lecture 04, Slide 5
• I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|ωj).
• Assume p(x|ωj) has a known parametric form and is completely determined
by the parameter vector θj (e.g., p(x|ωj) ~ N(μj, Σj),
where θj=[μ1, ..., μj , σ11, σ12, ..., σdd]).
• p(x|ωj) has an explicit dependence on θj: p(x|ωj, θj)
• Use training samples to estimate θ1, θ2,..., θc
• Functional independence: assume Di gives no useful informationabout θj for i≠j.
• Simplifies notation to a set D of training samples (x1,... xn) drawn independently from p(x|ω) to estimate ω.
• Because the samples were drawn independently:)()|(
1
n
kkpDp x
General Principle
ECE 8527: Lecture 04, Slide 6
• p(D|θ) is called the likelihood of θ with respect to the data.
• Given several training points• Top: candidate source distributions are
shown• Which distribution is the ML estimate?• Middle: an estimate of the likelihood of
the data as a function of θ (the mean)• Bottom: log likelihood
• The value of θ that maximizes this likelihood, denoted ,
is the maximum likelihood estimate (ML) of θ.
Example of ML Estimation
ECE 8527: Lecture 04, Slide 7
n
kk
n
kk
θ
p
p
p
p
l
Dpl
1
1
1
21
ln
))(ln(
maxargˆln:Define
.Let
.),...,,(Let
x
x
t • The ML estimate is found by solving this equation:
.0ln
]ln[
1
1
n
kk
n
kk
p
pl
x
x
• The solution to this equation can be a global maximum, a local maximum, or even an inflection point.• Under what conditions is it a
global maximum?
General Mathematics
ECE 8527: Lecture 04, Slide 8
• A class of estimators – maximum a posteriori (MAP) – maximize
where describes the prior probability of different parameter values.
pl
p
• An ML estimator is a MAP estimator for uniform priors.
• A MAP estimator finds the peak, or mode, of a posterior density.
• MAP estimators are not transformation invariant (if we perform a nonlinear transformation of the input data, the estimator is no longer optimum in the new space). This observation will be useful later in the course.
Maximum A Posteriori Estimation
ECE 8527: Lecture 04, Slide 9
• Consider the case where only the mean, θ = μ, is unknown:
)()(21])2ln[(
21
)]()(21exp[
)2(1ln[))(ln(
1
12/12/
kkd
kkdp
xx
xxx
t
tk
0ln1
n
kkp x
)())(ln( 1 kp xxkwhich implies:
)(
)]()(21[])2ln[(
21[
)]()(21])2ln[(
21[
1
1
1
k
kkd
kkd
x
xx
xx
t
t
because:
Gaussian Case: Unknown Mean
ECE 8527: Lecture 04, Slide 10
• Rearranging terms:
• Significance???
n
kk
n
kk
n
k
n
kk
n
kk
n
kk
n
n
1
1
1 1
1
1
1
1ˆ
0ˆ
0ˆ
0)ˆ(
0)ˆ(
x
x
x
x
x
• Substituting into the expression for the total likelihood:
0)(ln1
1
1
n
kk
n
kkpl xx
Gaussian Case: Unknown Mean
ECE 8527: Lecture 04, Slide 11
• Let θ = [μ,σ2]. The log likelihood of a SINGLE point is:
))(21])2ln[(
21))(ln( 1
1212
kt
k (xxxp k
22
21
2
12
2)(
21
)(1
))(ln(
k
k
x
xxpl θθθ k
• The full likelihood leads to:
n
k
n
kk
n
k
k
n
kk
xx
x
12
1
21
1 22
21
2
11
2
ˆ)ˆ(0ˆ2)ˆ(
ˆ21
0)ˆ(ˆ1
Gaussian Case: Unknown Mean and Variance
ECE 8527: Lecture 04, Slide 12
• This leads to these equations:
2
1
22
11
)ˆ1ˆˆ
1ˆˆ
n
kk
n
kk
xn
xn
(
• In the multivariate case:
n
kkk
n
kk
n
n
1
2
1
ˆˆ1ˆ
1ˆ
txx
x
• The true covariance is the expected value of the matrix ,which is a familiar result.
tkk ˆˆ xx
Gaussian Case: Unknown Mean and Variance
ECE 8527: Lecture 04, Slide 13
• Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later.
• Expected value of the ML estimate of the mean:
n
i
n
ii
n
ii
n
xEn
xn
EE
1
1
1
1
][1
]1[]ˆ[
22
1 12
2
11
22
22
][1
]11[
]ˆ[
])ˆ[(]ˆ[]ˆvar[
n
i
n
jji
n
jj
n
ii
xxEn
xn
xn
E
E
EE
Convergence of the Mean
ECE 8527: Lecture 04, Slide 14
• The expected value of xixj,, E[xixj,], will be μ2 for i ≠ j and μ2 + σ2 otherwise since the two random variables are independent.
• The expected value of xi2 will be μ2 + σ2.
• Hence, in the summation above, we have n2-n terms with expected value μ2 and n terms with expected value μ2 + σ2.
• Thus,
n
nnnn
222222
21]ˆvar[
• We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero).
which implies:
22
22 ])ˆ[(]ˆvar[]ˆ[ n
EE
Variance of the ML Estimate of the Mean
ECE 8527: Lecture 04, Slide 15
2
11
2
22
222
2222
)1(
][
][2][
][][2][)[(
n
ii
n
ii x
nx
xE
ExE
ExExExE
Note that this implies:22
1
2
n
iix
• Now we can combine these results. Recall our expression for the ML estimate of the variance:
n
iixn
E1
22 ]ˆ1[ˆ
• We will need one more result:
Variance Relationships
ECE 8527: Lecture 04, Slide 16
))(]ˆ[2)((1
])ˆ[]ˆ[2][(1
)]ˆˆ2([1ˆ1[ˆ
22
1
22
2
1
2
2
1
2
1
22
nxEn
ExExEn
xxEn
xn
E
in
i
n
iii
in
ii
n
ii
• Expand the covariance and simplify:
n
nn
nn
xxExxEn
xxEn
xxExE iin
jij
jin
jji
n
jjii
2222222
111
)((1))1((1
])[][(1][1][]ˆ[
• One more intermediate term to derive:
Covariance Expansion
ECE 8527: Lecture 04, Slide 17
2
1
2
1
22
1
2
2
1
2
2222
1
22
2222
1
22
22
1
222
)1(
)1(1)/11(1)(1
)(1
)22(1
))()(2)((1
))(]ˆ[2)((1ˆ
nn
nn
nn
nn
n
nn
nnn
nnn
nxEn
n
i
n
i
n
i
n
i
n
i
n
i
i
n
i
• Substitute our previously derived expression for the second term:
Biased Variance Estimate
ECE 8527: Lecture 04, Slide 18
22
1
22 1]ˆ1[ˆ n
nxn
En
ii
• An unbiased estimator is:
n
i
tiin 1ˆˆ
11 xxC
• These are related by:
Cnn )1(ˆ
which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.