EM (Expectation-Maximization)

The EM algorithm seems at first to solve a very specific problem but it turns out to be quite useful in general

Example Let's return to the mixture model we considered earlier: Y₁~Beta(a,1), Y₂~Beta(1,b), Z~Ber(π) and X=(1-Z)Y₁+ZY₂ Let's assume for the moment that in addition to the X₁, .., X_n we also observe Z ₁, .., Z _n where Z=1 if observation is from group1, Z=0 if group 2. Let y=∑z_i be the number of observations from group 1, then

but this is the same as the maximum likelihood equation for X~Beta(a,1) where we only count those x_i's where z_i=1. In other words, if we knew the z_i's, estimating a and b would be an easy problem: just find the mle's for groups 1 and 2 separately.

On the other hand,

so if we knew the parameters a and b we could estimate each of the z_i's.

Note that if X~Beta(a,1) we have

Of course πf₁(x)/(πf₁(x)+(1-π)f₂(x)) (0,1), so we need to replace the z_i's with weights w_i and the formulas become

This is then the basic idea of the EM algorithm:

• in the M(aximization) step assume you know the weights w ₁,..,w_n and estimate the parameters via mle
• in the E(stimation) step use these parameters to estimate the w ₁,..,w_n.

See

em1()

for an implementation. Notice a nice feature of the EM algorithm: in each step it finds a point which is at least as good as the last one.

How well does this work? Let's do a little checking: generate some data, use em1 to find the mle, and find the log-likelihood for some points around the mle:

x=c(rbeta(500,2,1),rbeta(500,1,2))
p=em1(x=x,start=c(0.5,2,2),Show=F)
p
em1.test(x,p[1:3])

The EM algorithm was originally invented by Dempster and Laird in 1977 to deal with a common problem in Statistics called censoring: say we are doing a study on survival of patients after cancer surgery. Any such study will have a time limit after which we will have to start with the data analysis, but hopefully there will still be some patients who are alive, so we don't know their survival times, but we do know that the survival times are greater than the time that has past sofar. We say the data is censored at time T. The number of patients with survival times >T is important information and should be used in the analysis. If we order the observations into (x₁, .., x_n) the uncensored observations (the survival times of those patients that are now dead) and (x_n+1, .., x_n+m) the censored data, the likelihood function can be written as

because all we know of the censored data is that P(X_i>T)=1-F(T|θ)

If we had also observed the survival-times of the censored patients, say z=(z_n+1, .., z_n+m) we could have written the complete-data likelihood

and again we can use the EM algorithm to estimate θ:

• in the M step assume you know the z₁,..,z_n and estimate θ.
• in the E step use θ to estimate the z₁,..,z_n

Example Say X_i~Exp(θ), we have data X ₁, .., X_n and we know that m observations were censored at T. Now

Of course E[Z|Z>T]=T+1/θ also follows immediatley from the memory-less property of the exponential!

so the EM algorithm proceeds as follows:

• in the M step assume you know the z₁,..,z_n and estimate θ=1/mean({x₁, .., x _n,z_n+1, .., z_n+m}).
• in the E step use θ to estimate the z₁,..,z_n =1/θ+T

em2()

The application of the EM algorithm to the mixture problem above is to the case where every observation is censored, because we don't know any of z's!