Ryan Martin UIC rgmartinhomepages.math.uic.edu/~rgmartin/Teaching/Stat451/Slides/... · 2016. 3. 9. · Stat 451 Lecture Notes 0412 EM Algorithm Ryan Martin UIC rgmartin 1Based on

Stat 451 Lecture Notes 0412

EM Algorithm

Ryan MartinUIC

www.math.uic.edu/~rgmartin

1Based on Ch. 4 in Givens & Hoeting and Ch. 13 in Lange2Updated: March 9, 2016

1 / 47

www.math.uic.edu/~rgmartin

Outline

1 Problem and motivation

2 Definition of the EM algorithm

3 Properties of EM

4 Examples

5 Estimating standard errors

6 Different versions of EM

7 Summary

2 / 47

Notion of “missing data”

Let X denote the observable data and θ the parameter to beestimated.

The EM algorithm is particularly suited for problems in whichthere is a notion of “missing data”.

The missing data can be actual data that is missing, or some“imaginary” data that exists only in our minds (andnecessarily missing).

The point is that IF the missing data were available, thenfinding the MLE for θ would be relatively straightforward.

3 / 47

Notation

Again, X is the observable data.

Let Y denote the complete data.3

Usually we think of Y as being composed of observable dataX and missing data Z , that is, Y = (X ,Z ).

Perhaps more generally, we think of the observable data X asa sort of projection of the complete data, i.e., “X = M(Y )”.

This suggests a notion of marginalization...

The basic idea behind the EM algorithm is to iterativelyimpute the missing data.

3This is the notation used in G&H which, as they admit, is not standard inthe EM literature.

4 / 47

Example – mixture model

Here is an example where the “missing data” is not real.

Suppose X = (X1, . . . ,Xn) consists of iid samples from themixture

αN(µ1, 1) + (1− α) N(µ2, 1),

where θ = (α, µ1, µ2) is to be estimated.

IF we knew which of the two groups Xi was from, then itwould be straightforward to get the MLE for θ, i.e., justcalculate the group means.

The missing part Z = (Z1, . . . ,Zn) is the group label, i.e.,

Zi =

{1 if Xi ∼ N(µ1, 1)

0 if Xi ∼ N(µ2, 1),i = 1, . . . , n.

5 / 47

Outline



3 Properties of EM

4 Examples



7 Summary

6 / 47

More notation

Complete data Y = (X ,Z ) splits to the observed data X andmissing data Z .

The complete data likelihood θ 7→ LY (θ) is the jointdistribution of (X ,Z ).

The observed likelihood θ 7→ LX (θ) is obtained bymarginalizing the joint distribution of (X ,Z ).

The conditional distribution of Z , given X , is an essentialpiece: θ 7→ LZ |X (θ).

Though the same notation “L” is used for all the likelihoods,it should be clear that these are all distinct functions of θ.

7 / 47

Example – mixture model (cont.)

Complete data Y = (Y1, . . . ,Yn), where each Yi consists ofthe observed data Xi with the missing group label Zi .

Observed data likelihood is

LX (θ) =n∏

i=1

{αN(Xi | µ1, 1) + (1− α)N(Xi | µ2, 1)};

not a nice function—the sum is inside the product.

Complete data likelihood is much nicer—write it out!

The conditional distribution of Z , given X , is determined bythe conditional probabilities

Pθ(Zi = 1 | Xi ) =αN(Xi | µ1, 1)

αN(Xi | µ1, 1) + (1− α)N(Xi | µ2, 1).

8 / 47

EM formulation

The EM works with some new function:

Q(θ′ | θ) = Eθ{log LY (θ′) | X},

the conditional expectation of the complete data loglikelihood, at θ′, given X and the particular value θ.

Implicit in this expression is that, given X , the only “random”part of Y is the missing data Z .

So, in this expression, the expectation is actually with respectto Z , given X , i.e.,

Q(θ′ | θ) =

∫log{L(X ,z)(θ′)} Lz|X (θ) dz .

9 / 47

EM formulation (cont.)

The EM algorithm iterates computing Q(θ′ | θ), whichinvolves an expectation, and then maximizing it.

Start with a fixed θ(0).

At iteration t ≥ 1 do:

E-step. Evaluate Qt(θ) := Q(θ | θ(t−1));M-step. Update θ(t) = arg maxθ Qt(θ).

Repeat these steps until practical convergence is reached.

10 / 47

A super-simple example

Goal is to maximize the observed data likelihood.

But EM iteratively maximizes some other function, so it’s notclear that we are doing something reasonable.

Before we get to theory, it helps to consider a simple exampleto see that EM is doing the right thing.

Y = (X ,Z ), where X ,Ziid∼ N(θ, 1), but Z is missing.

Observed data MLE θ̂ = X .

The Q function in the E-step is

Q(θ | θ(t)) = −12{(θ − X )2 + (θ − θ(t))2}.

Find the M-step update—what should happen as t →∞?

11 / 47

Outline



3 Properties of EM

4 Examples



7 Summary

12 / 47

Ascent property

The claimed ascent property of EM is as follows:

LX (θ(t+1)) ≥ LX (θ(t)), ∀ t.

To prove this, we first need a simple identity involving joint,conditional, and marginal densities:

log fV (v) = log fU,V (u, v)− log fU|V (u | v).

The next general fact is the non-negativity of relative entropyor Kullback–Leibler divergence:∫

logp(x)

q(x)p(x) dx ≥ 0, equality iff p = q.

Follows from Jensen’s inequality, since y 7→ − log y is convex.

13 / 47

Ascent property (cont.)

Using the density identity, we can write

log LX (θ) = log LY (θ)− log LZ |X (θ).

Take expectation wrt Z , given X and θ(t), gives

log LX (θ) = Q(θ | θ(t))− H(θ | θ(t)),

whereH(θ | θ(t)) = Eθ(t){log LZ |X (θ) | X}.

It follows from non-negativity of KL that

H(θ(t) | θ(t))− H(θ | θ(t)) ≥ 0, ∀ θ.

14 / 47

Ascent property (cont.)

Key observation: picking θ(t+1) such that

Q(θ(t+1) | θ(t)) ≥ Q(θ(t) | θ(t))

will increase both terms in the expression for LX (·).

So maximizing Q(· | θ(t)) in the M-step will result in updateswith the desired ascent property:

LX (θ(t+1)) ≥ LX (θ(t)), ∀ t.

This does not imply that the EM updates will necessarilyconverge to the MLE, just that they are surely moving in theright direction.

15 / 47

Further properties

One can express the EM updates through a abstract mappingΨ, i.e., θ(t+1) = Ψ(θ(t)).

If EM converges to θ̂, then θ̂ must be a fixed-point of Ψ.

Do a Taylor approximation of Ψ(θ̂(t)) near θ̂:

θ(t+1) − θ̂︸︷︷︸Ψ(θ(t))−Ψ(θ̂)

≈ Ψ′(θ(t))(θ(t) − θ̂).

If parameter is one-dimensional, then the convergence ordercan be seen to be Ψ′(θ̂), provided that θ̂ is a (local) maxima.

16 / 47

EM for exponential family models

Recall that a model/joint distribution Pθ for data Y is anatural exponential family if the log-likelihood is of the form

log LY (θ) = const + log a(θ) + θ>s(y),

where s(y) is the “sufficient statistic.”

For problems where the complete data Y is modeled as anexponential family, EM takes a relatively simple form.

This is an important case since many examples involveexponential families.

17 / 47

EM for exponential family models (cont.)

For exponential families, Q function looks like

Q(θ | θ(t)) = const + log a(θ) +

∫θ>s(y)Lz|X (θ(t)) dz .

To maximize this, take derivative wrt θ and set to zero:

=⇒ −a′(θ)

a(θ)=

∫s(y)Lz|X (θ(t)) dz .

From Stat 411, you know that the left-hand side is Eθ{s(Y )}.Let s(t) be the right-hand side.

M-step updates θ(t) → θ(t+1) by solving the equation:

Eθ{s(Y )} = s(t).

18 / 47

EM for exponential family models (cont.)

E-step. Compute s(t) based on guess θ(t).

M-step. Update guess to θ(t+1) by solving the equation

Eθ{s(Y )} = s(t).

19 / 47

Outline



3 Properties of EM

4 Examples



7 Summary

20 / 47

Example 1 – censored exponential model

Complete data Y1, . . . ,Yniid∼ Exp(θ), rate.

Complete data log-likelihood

log LY (θ) = n log θ − θ∑n

i=1 Yi︸︷︷︸s(Y )

.

Suppose some observations are right-censored, i.e., only alower bound observed.

Write observed data as pairs (Xi , δi ), where

Xi = min(Yi , ci ), (ci ’s are non-random)

δi = I{Xi=Yi}.

Missing data Z consists of the actual event times for thecensored observations.

21 / 47

Example 1 – censored exponential model (cont.)

For EM, we first need to compute s(t)...

Only censored cases are of concern.

If an observation Yi is right-censored at ci , then we know thatci is a lower bound.

Recall that exponential has a memory-less property.

So, E-step of the EM requires

s(t) =n∑

i=1

[δiXi + (1− δi )Eθ(t){Yi | censored}

]=

n∑i=1

[δiXi + (1− δi )(Xi + 1/θ(t))

]= nX +

1

θ(t)

n∑i=1

(1− δi ).

22 / 47


Clearly, Eθ{s(Y )} = n/θ.

So, the M-step requires we solve for θ in

nX +1

θ(t)

n∑i=1

(1− δi ) =n

θ.

In particular, the EM update in this case is

θ(t+1) ={X +

1

θ(t)· 1

n

n∑i=1

(1− δi )}−1

.

Iterate this update till convergence.

23 / 47


Simulated data: n = 30, θ = 3, censored at 0.632.Picture below shows the observed data likelihood and the EMsteps starting at θ(0) = 7.

0 2 4 6 8

-250

-200

-150

-100

-50

θ

L X(θ)

24 / 47

Example 2 – probit regression

Recall the probit regression model: Xi ∼ Ber(Φ(u>i θ).

We can use the EM algorithm to easily get the MLE of θ.

Write the complete data as Y = (Y1, . . . ,Yn), whereYi ∼ N(u>i θ, 1), and

Xi =

{1 if Yi > 0

0 if Yi ≤ 0.

Exercise: Check that Xi defined in this way has the samedistribution as that given by the probit model...

Basically, we observe the sign of the complete data, but theactual values are missing.

25 / 47

Example 2 – probit regression (cont.)

The complete-data problem is easy, just a normal linearregression with known variance—exponential family.

s(Y ) = U>Y , where U is the design matrix.

Observed data tells us the sign of Yi , so the conditionalexpectation is that of a truncated normal distribution:4

Eθ(t)(Yi | Xi ) = µ(t)i + wi

ϕ(µ(t)i )

Φ(wiµ(t)i )︸︷︷︸

v(t)i

,

{wi = 2Xi − 1

µ(t)i = u>i θ

(t).

This completes E-step; M-step requires solving

U>Uθ︸︷︷︸Eθ{s(Y )}

= U>Uθ(t) + U>v (t)︸︷︷︸s(t)

.

4http://en.wikipedia.org/wiki/Truncated_normal_distribution26 / 47

http://en.wikipedia.org/wiki/Truncated_normal_distribution

Example 2 – probit regression (cont.)

Simulated data: n = 50; intercept θ1 = 0, slope θ2 = 1;predictor variables iid N(0, 42).Plot below shows data and EM fitted probit regression line.

-5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

X

Y

27 / 47

Example 3 – robust regression

Consider the linear model yi = x>i β + σεi .

Least-squares estimators, based on normal errors, are sensitive(not robust) to “outlier” observations.

Remedy: fit a model with heavier-than-normal tails.

One approach to robust regression is to model ε with aStudent-t distribution with small degrees of freedom.

This model can be fit with standard optimization tools, but aclever approach makes application of EM quite simple.

Key observation: Student-t is a scale mixture of normals,

f (ε) =

∫N(ε | 0, νz ) ChiSq(z | ν) dz .

28 / 47

Example 3 – robust regression (cont.)

For simplicity, we assume that ν = df is known.

Think of the Zi values implicitly attached to the Student-terror distribution for εi as “missing data.”

If we knew Z = (Z1, . . . ,Zn) then the problem would just bejust a simple modification of the basic normal model...

For θ = (β, σ), the complete data log-likelihood is

log LY (θ) =n∑

i=1

log N(yi − x>i β | 0, νσ2

Zi).

E-step requires expectation wrt conditional distribution of Z ,given data and a guess θ(t)...

29 / 47


It can be shown (HW?) that the conditional distribution of Zi ,given observed data and a guess θ(t) is{1

ν

(yi − x>i β(t)

σ(t)

)2+ 1}−1× ChiSq(ν + 1), i = 1, . . . , n.

Then the Q function in the E-step is obtained by plugging inthe conditional expectation of Zi , i.e.,

Q(θ | θ(t)) = −n log σ2

2− 1

2σ2

n∑i=1

w(t)i (yi − x>i β)2,

where

w(t)i = (ν + 1)

{(yi − x>i β(t)

σ(t)

)2+ ν}−1

, i = 1, . . . , n.

M-step is equivalent to a weighted least squares problem...

30 / 47


Belgian phone call data – in R’s MASS library.

Compare fit of LS versus Student-t via EM (df=4).

1950 1955 1960 1965 1970

050

100

150

200

Year

Cal

ls (i

n m

illio

ns)

EMLS

31 / 47

Outline



3 Properties of EM

4 Examples



7 Summary

32 / 47

Challenge

The EM algorithm is designed to return the MLE θ̂.

It does not, however, say anything about standard errors.

Recall that if we run, say, “BFGS” via the function optim inR, then we can request that the Hessian at the MLE bereturned, which can be used to approximate the standarderrors of θ̂.

The challenge is that the EM doesn’t work directly with theobserved data log-likelihood.

Question: how to compute standard errors within EM?

33 / 47

Analytical calculation

Of course, if we can write down a formula the negative secondderivative of − log LX (θ) at θ = θ̂, or the Fisher information,I (θ̂), then we have an estimator of the standard errors.

For the probit regression model (Example 2 above), we have aformula for the Fisher information:

In(θ) =n∑

i=1

ϕ(u>i θ)2

Φ(u>i θ){1− Φ(u>i θ)}uiu>i .

So, we can just plug in our MLE θ̂ from the EM into thisformula and (numerically) invert the matrix.

Can also numerically differentiate − log LX (θ)...

34 / 47

Bootstrap

We will talk in more detail about bootstrap later, but here is alittle taste of the main idea.

We want to estimate the variance of θ̂, as computed by EM,but it’s hard because we only have one sample of θ̂.

If we had many samples/copies of θ̂, then the variance is easyto calculate.

How to get multiple copies of θ̂?

Bootstrap principle is to resample (with replacement) fromthe observed data X = (X1, . . . ,Xn), many times.

35 / 47

Bootstrap (cont.)

Fix a large number B.

For b = 1, . . . ,B, compute an estimate θ̂b as follows:

Sample X ?b = (X ?

b1, . . . ,X?bn) with replacement from the

observed data X = (X1, . . . ,Xn).Compute θ̂b by applying the EM algorithm to X ?

b .

Estimate the variance of the MLE θ̂ with just the samplevariance (covariance) of θ̂1, . . . , θ̂B .

Motivation for this idea comes from the fact that theempirical distribution for X ought to be similar to the truesampling model, at least for large n.

May be expensive in the EM context because it requires Bseparate EM runs...

36 / 47

Other methods

Numerical differentiation of the score function ∂∂θ log LX (θ) at

θ = θ̂, where θ̂ is the EM solution.

Supplemented EM (SEM) uses multiple EM runs, apparentlymore stable than numerical differentiation.

Louis’s method looks interesting (based on a missinginformation principle: “iX (θ) = iY (θ)− iZ |X (θ)”) but involvessome specialized computations.

Using the empirical information seems attractive: based onidea that Fisher information is the variance of the score.

37 / 47

Outline



3 Properties of EM

4 Examples



7 Summary

38 / 47

Considerations

In each of the two main steps of the EM algorithm, there arepotentially some non-trivial computations involved.

In the E-step, an expectation is required and, in general, thiscannot be done analytically.

Similarly, in the M-step, optimization is required and, ingeneral, this cannot be done analytically.

We know that both integration and optimization can be donenumerically, but there are concerns about efficiency, i.e.,nested loops.

So, in general, there are questions about how to efficientlydesign EM algorithms.

39 / 47

Modifying the E-step

In the E-step, we need to compute an expectation withrespect to the conditional distribution of Z , given X .

In some case, this boils down to several one-dimensionalintegrals, which we could possibly do with quadrature.

An alternative is to replace numerical integration with MonteCarlo (more on this general approach later).

This is attractive but also may be expensive due to having torun Monte Carlo at every E-step in the EM.Gives EM some kind of Bayesian flavor...

If you haven’t noticed, EM folks like acronyms, so the MonteCarlo EM is called MCEM.

40 / 47

Modifying the M-step

In the M-step, we need to maximize Q(θ | θ(t)) wrt θ.

If not doable analytically, then we can consider any one ofthose numerical optimization routines considered previously.

Concern is that many numerical optimizations may beexpensive.

Other ideas:

Maximize Q one component at a time—ECM algorithm.Do only one iteration of Newton at each M-step—EM gradient.

41 / 47

One specific extension – PX-EM

Ordinary EM is based on the idea that computations can besimplified IF some “missing data” were known.

The EM is a very powerful tool, but often suffers from slowconvergence.

A counter-intuitive idea is to consider introducing moreparameters to speed up convergence.

This is called PX-EM, where PX = “parameter expansion”.

The PX-EM enjoys the same ascent property as EM, but itsrate of convergence is no slower.

42 / 47

PX-EM (cont.)

Treat parameter θ as a (not one-to-one) function of (ψ, α).

Intuition is that the original model corresponds to the casewhere α is held fixed at some specified value α0, i.e.,θ = f (ψ, α0).

Start with the complete-data log likelihood for (ψ, α), whichwe write as log LY (ψ, α).

For exponential families, this will be a linear function of thesufficient statistics for the expanded (ψ, α)-model.

Then we can proceed to iteratively compute conditionalexpectation and maximization.

There is a slight difference in the PX E-step, however.

43 / 47

PX-EM (cont.)

At iteration t, suppose we have (ψ(t), α(t)), which defines θ(t).

PX E-step. Set Q(ψ, α | ψ(t), α0), the conditional expectation ofcomplete-data log-likelihood, but note that we useα0 instead of the current guess α(t).

PX M-step. Maximize Q to get (ψ(t+1), α(t+1), and computeθ(t+1) = f (ψ(t+1), α(t+1)).

Advantage of this PX version of EM is that it improves the M-stepby using extra information in the enlarged model.

44 / 47

Example 2 – probit regression with PX-EM

Recall the probit regression model: Xi ∼ Ber(Φ(u>i θ)).

The complete data in this model is Yi ∼ N(u>i θ, 1).

Expand the parameter θ by introducing a variance parameter,

Yi ∼ N(u>i ψ, α2), α0 = 1.

Now, the sufficient statistics for the complete-data model are

s(Y ) = (U>Y ,Y>Y ).

Properties (mean and variance) of the truncated normaldistribution help us to carry out the PX E-step.

PX M-step is pretty straightforward, like for the ordinary EM.

See R code for the details.

45 / 47

Outline



3 Properties of EM

4 Examples



7 Summary

46 / 47

Remarks

The EM algorithm is a nice tool for maximizing non-standardlikelihood functions, particularly in cases where there is somenotion of “missing data.”

Requires some effort to derive the E- and M-steps.

There is a huge literature5 on EM, and mixture models (HW?)and censored-data problems are important applications.

Ideas of sort of “randomly imputing” missing values is cleverand has some Bayesian flavor—data augmentation...

The main challenge with EM is that it’s convergence may beslow, but there are some remedies available.

Interesting question: can EM be parallelized?

5As of today, the original EM paper (Dempster, Laird, and Rubin, JRSS-B1977) has been cited over 44,000 times!

47 / 47

Ryan Martin UIC rgmartinhomepages.math.uic.edu/~rgmartin/Teaching/Stat451/Slides/... · 2016. 3. 9. · Stat 451 Lecture Notes 0412 EM Algorithm Ryan Martin UIC rgmartin 1Based on

Documents