LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION

ECE 8443 – Pattern RecognitionECE 8527 – Introduction to Machine Learning and Pattern Recognition

LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION

• Objectives:Discrete FeaturesMaximum LikelihoodBias in ML EstimatesBayesian EstimationExample

• Resources:D.H.S: Chapter 3 (Part 1)D.H.S.: Chapter 3 (Part 2)J.O.S.: TutorialNebula: LinksBGSU: ExampleA.W.M.: TutorialA.W.M.: LinksS.P.: PrimerCSRN: Unbiased

A.W.M.: BiasWiki: MLM.Y.: ML TutorialJ.O.S.: Bayesian Est.J.H.: Euro Coin

http://rii.ricoh.com/~stork/DHSch3part1.ppt

http://rii.ricoh.com/~stork/DHSch3part2.ppt

http://www.nebulasearch.com/encyclopedia/article/Bayesian_inference.html

http://bayes.bgsu.edu/nsf_web/tutorial/a_brief_tutorial.htm

http://www-2.cs.cmu.edu/~awm/tutorials/mle.html

http://www-2.cs.cmu.edu/~awm/tutorials/list.html

http://statgen.iop.kcl.ac.uk/bgim/mle/sslike_1.html

http://www.ex.ac.uk/~ajwills/courses/rm1/stats/variance.ppt

http://cnx.rice.edu/content/m11426/latest/

http://www.eas.asu.edu/~morrell/556/Lecture11.pdf

http://www-2.cs.cmu.edu/~awm/tutorials/mle12.pdf

http://en.wikipedia.org/wiki/Maximum_likelihood

http://en.wikipedia.org/wiki/Maximum_likelihood

http://www.psy.vanderbilt.edu/faculty/palmeri/P351-modeling/readings/myung-tutorial-mle.pdf

http://www.psy.vanderbilt.edu/faculty/palmeri/P351-modeling/readings/myung-tutorial-mle.pdf

http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html

http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.html

http://www.isip.msstate.edu/publications/seminars/msstate_misc/2002/euro_coin/presentation_v0.pdf

http://www.weibull.com/LifeDataWeb/image/apa_fig3.gif

http://www.mat.ulaval.ca/informatique/guide94/img14.png

http://www.isip.msstate.edu/publications/seminars/msstate_misc/2002/euro_coin/presentation_v0.pdf

ECE 8527: Lecture 04, Slide 2

• For problems where features are discrete:))( jj |ωPdp xxx

x (

• Bayes formula involves probabilities (not densities):

xx

xx

xx

PPP

PpPp

P jjj

jjj

where

c

jjj PPP

1xx

• Bayes rule remains the same:

)|(minarg* xii

αRα

• The maximum entropy distribution is a uniform distribution:

Discrete Features

NP i

1)( xx


• Consider independent binary features:t

dxx ),...,( 1x

• Assuming conditional independence:

ii xi

xi

d

ippωP

1

11 )1()|(x ii x

ixi

d

iqqωP

1

12 )1()|(x

• The likelihood ratio is:

ii

ii

xi

xi

xi

xi

d

i qqpp

ωPωP

1

1

12

1

)1()1(

)|()|(

xx

• The discriminant function is:

Discriminant Functions For Discrete Features


• In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(ωi), and class-conditional densities, p(x|ωi).

• What can we do if we do not have this information?• What limitations do we face?• There are two common approaches to parameter estimation: maximum

likelihood and Bayesian estimation.• Maximum Likelihood: treat the parameters as quantities whose values are

fixed but unknown.• Bayes: treat the parameters as random variables having some known prior

distribution. Observations of samples converts this to a posterior.• Bayesian Learning: sharpen the a posteriori density causing it to peak near

the true value.

Introduction to Maximum Likelihood Estimation


• I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|ωj).

• Assume p(x|ωj) has a known parametric form and is completely determined

by the parameter vector θj (e.g., p(x|ωj) ~ N(μj, Σj),

where θj=[μ1, ..., μj , σ11, σ12, ..., σdd]).

• p(x|ωj) has an explicit dependence on θj: p(x|ωj, θj)

• Use training samples to estimate θ1, θ2,..., θc

• Functional independence: assume Di gives no useful informationabout θj for i≠j.

• Simplifies notation to a set D of training samples (x1,... xn) drawn independently from p(x|ω) to estimate ω.

• Because the samples were drawn independently:)()|(

1

n

kkpDp x

General Principle


• p(D|θ) is called the likelihood of θ with respect to the data.

• Given several training points• Top: candidate source distributions are

shown• Which distribution is the ML estimate?• Middle: an estimate of the likelihood of

the data as a function of θ (the mean)• Bottom: log likelihood

• The value of θ that maximizes this likelihood, denoted ,

is the maximum likelihood estimate (ML) of θ.

Example of ML Estimation


n

kk

n

kk

θ

p

p

p

p

l

Dpl

1

1

1

21

ln

))(ln(

maxargˆln:Define

.Let

.),...,,(Let

x

x

t • The ML estimate is found by solving this equation:

.0ln

]ln[

1

1

n

kk

n

kk

p

pl

x

x

• The solution to this equation can be a global maximum, a local maximum, or even an inflection point.• Under what conditions is it a

global maximum?

General Mathematics


• A class of estimators – maximum a posteriori (MAP) – maximize

where describes the prior probability of different parameter values.

pl

p

• An ML estimator is a MAP estimator for uniform priors.

• A MAP estimator finds the peak, or mode, of a posterior density.

• MAP estimators are not transformation invariant (if we perform a nonlinear transformation of the input data, the estimator is no longer optimum in the new space). This observation will be useful later in the course.

Maximum A Posteriori Estimation


• Consider the case where only the mean, θ = μ, is unknown:

)()(21])2ln[(

21

)]()(21exp[

)2(1ln[))(ln(

1

12/12/

kkd

kkdp

xx

xxx

t

tk

0ln1

n

kkp x

)())(ln( 1 kp xxkwhich implies:

)(

)]()(21[])2ln[(

21[

)]()(21])2ln[(

21[

1

1

1

k

kkd

kkd

x

xx

xx

t

t

because:

Gaussian Case: Unknown Mean


• Rearranging terms:

• Significance???

n

kk

n

kk

n

k

n

kk

n

kk

n

kk

n

n

1

1

1 1

1

1

1

1ˆ

0ˆ

0ˆ

0)ˆ(

0)ˆ(

x

x

x

x

x

• Substituting into the expression for the total likelihood:

0)(ln1

1

1

n

kk

n

kkpl xx

Gaussian Case: Unknown Mean


• Let θ = [μ,σ2]. The log likelihood of a SINGLE point is:

))(21])2ln[(

21))(ln( 1

1212

kt

k (xxxp k

22

21

2

12

2)(

21

)(1

))(ln(

k

k

x

xxpl θθθ k

• The full likelihood leads to:

n

k

n

kk

n

k

k

n

kk

xx

x

12

1

21

1 22

21

2

11

2

ˆ)ˆ(0ˆ2)ˆ(

ˆ21

0)ˆ(ˆ1

Gaussian Case: Unknown Mean and Variance


• This leads to these equations:

2

1

22

11

)ˆ1ˆˆ

1ˆˆ

n

kk

n

kk

xn

xn

(

• In the multivariate case:

n

kkk

n

kk

n

n

1

2

1

ˆˆ1ˆ

1ˆ

txx

x

• The true covariance is the expected value of the matrix ,which is a familiar result.

tkk ˆˆ xx

Gaussian Case: Unknown Mean and Variance


• Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later.

• Expected value of the ML estimate of the mean:

n

i

n

ii

n

ii

n

xEn

xn

EE

1

1

1

1

][1

]1[]ˆ[

22

1 12

2

11

22

22

][1

]11[

]ˆ[

])ˆ[(]ˆ[]ˆvar[

n

i

n

jji

n

jj

n

ii

xxEn

xn

xn

E

E

EE

Convergence of the Mean


• The expected value of xixj,, E[xixj,], will be μ2 for i ≠ j and μ2 + σ2 otherwise since the two random variables are independent.

• The expected value of xi2 will be μ2 + σ2.

• Hence, in the summation above, we have n2-n terms with expected value μ2 and n terms with expected value μ2 + σ2.

• Thus,

n

nnnn

222222

21]ˆvar[

• We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero).

which implies:

22

22 ])ˆ[(]ˆvar[]ˆ[ n

EE

Variance of the ML Estimate of the Mean


2

11

2

22

222

2222

)1(

][

][2][

][][2][)[(

n

ii

n

ii x

nx

xE

ExE

ExExExE

Note that this implies:22

1

2

n

iix

• Now we can combine these results. Recall our expression for the ML estimate of the variance:

n

iixn

E1

22 ]ˆ1[ˆ

• We will need one more result:

Variance Relationships


))(]ˆ[2)((1

])ˆ[]ˆ[2][(1

)]ˆˆ2([1ˆ1[ˆ

22

1

22

2

1

2

2

1

2

1

22

nxEn

ExExEn

xxEn

xn

E

in

i

n

iii

in

ii

n

ii

• Expand the covariance and simplify:

n

nn

nn

xxExxEn

xxEn

xxExE iin

jij

jin

jji

n

jjii

2222222

111

)((1))1((1

])[][(1][1][]ˆ[

• One more intermediate term to derive:

Covariance Expansion


2

1

2

1

22

1

2

2

1

2

2222

1

22

2222

1

22

22

1

222

)1(

)1(1)/11(1)(1

)(1

)22(1

))()(2)((1

))(]ˆ[2)((1ˆ

nn

nn

nn

nn

n

nn

nnn

nnn

nxEn

n

i

n

i

n

i

n

i

n

i

n

i

i

n

i

• Substitute our previously derived expression for the second term:

Biased Variance Estimate


22

1

22 1]ˆ1[ˆ n

nxn

En

ii

• An unbiased estimator is:

n

i

tiin 1ˆˆ

11 xxC

• These are related by:

Cnn )1(ˆ

which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.

• Therefore, the ML estimate is biased:

However, the ML estimate converges (and is MSE).

Expectation Simplification

http://www.cs.colorado.edu/~mburl/courses/CSCI5622/Fall2003/lecture6.pdf


Summary• Discriminant functions for discrete features are completely analogous to the

continuous case (end of Chapter 2).• To develop an optimal classifier, we need reliable estimates of the statistics of

the features.• In Maximum Likelihood (ML) estimation, we treat the parameters as having

unknown but fixed values.• Justified many well-known results for estimating parameters (e.g., computing

the mean by summing the observations).• Biased and unbiased estimators.• Convergence of the mean and variance estimates.

LECTURE 04: MAXIMUM LIKELIHOOD ESTIMATION

Documents