Parameter Estimation

Maximum Likelihood

In this chapter we will study the problem of parameter estimation. In its most general form this is as follows: we have a sample \(X_1,..,X_n\) from some probability density \(f(x;\theta)\). Here both x and \(\theta\) might be vectors. Also we will use the term density for both the discrete and the continuous case.

The problem is to find an estimate of \(\theta\) based on the data \(X_1,..,X_n\), that is a function (called a statistic) \(T(X_1,..,X_n)\) such that in some sense \(T(X_1,..,X_n) \approx \theta\).

Generally one also wants to have some idea of the accuracy of this estimate, that is one wants to calculate the standard error. Most commonly this is done by finding a confidence interval.

There are many ways to approach this problem, we will here only discuss the method of maximum likelihood. This works as follows. If the sample is independent the joint density is given by

\[ f(x_1,..,x_n; \theta) = \prod_{i=1}^n f(x_i, \theta) \] and the log-likelihood function is defined by

\[ l(\theta) = \sum_{i=1}^n \log f(x_i, \theta) \] the estimate of \(\theta\) is then found by maximizing the function \(l\). Let’s call this \(\hat \theta\).

Large Sample Theory of MLEs and Confidence Intervals

One major reason for the popularity of this method is the following celebrated theorem, due to Sir R.A. Fisher: under some regularity conditions

\[ \sqrt{n} (\hat \theta -\theta) \sim N(0, \sqrt{I^{-1}}) \] where \(N(\mu, \sigma)\) is the normal distribution and \(I\) is the Fisher Information, given by

\[ I(\theta)_{ij} = -E \left[ \frac{\partial^i\partial^j}{\partial \theta^i\partial \theta^j} \log f(x;\theta) \right] \]

and so it is very easy to find a \((1-\alpha)100\%\) confidence interval for (say) \(\theta_i\) as

\[ \hat \theta \pm z_{\alpha/2} \sqrt{I^{-1}_{ii}} \] where \(z_{\alpha}\) is the \((1-\alpha)100\%\) quantile of the standard normal distribution. In R this is found with

qnorm(1-0.05/2)
## [1] 1.96

if \(\alpha=0.05\).

Example: Inference for mean of normal distribution

\(X_1,..,X_n \sim N(\mu, \sigma)\), \(\sigma\) known.

\[ \begin{aligned} &f(x; \mu) = \frac1{\sqrt{2 \pi \sigma^2}} \exp \left\{-\frac1{2\sigma^2} (x-\mu)^2 \right\}\\ &l(\mu) = \sum \log f(x_i, \mu) = \\ & n \log(\frac1{\sqrt{2 \pi \sigma^2}}) - \frac1{2\sigma^2} \sum (x_i-\mu)^2 \\ &\frac{d}{d \mu} l(\mu) = \frac1{\sigma^2}\sum (x_i-\mu) = 0 \\ &\hat \mu = \frac1n \sum x_i \end{aligned} \]

Here we have only one parameter (\(\mu\)), so the Fisher Information is given by

\[ I(\mu) = -E\left[ \frac{d^2 f(x;\mu)}{d \mu^2} \right] \]

and so we find

\[ \begin{aligned} & \frac{d}{d \mu} \log f(x; \mu)= \frac1{\sigma^2}(x-\mu)\\ & \frac{d^2}{d \mu^2}\log f(x;\mu)=-\frac1{\sigma^2}\\ &-E\left[ \frac{d^2 f(x;\mu)}{d \mu^2} \right] = -E\left[-\frac1{\sigma^2} \right] =\frac1{\sigma^2}\\ &\sqrt{I(\mu)^{-1}} = \sqrt{\frac1{1/\sigma^2} } = \sigma\\ & \sqrt{n}(\hat \mu -\mu) \sim N(0, \sigma) \\ &\hat \mu \sim N(\mu, \sigma/\sqrt{n}) \end{aligned} \]

and we find the \((1-\alpha)100\%\) confidence interval to be

\[ \hat \mu \pm z_{\alpha/2}\sigma/\sqrt{n} \] this is of course the standard answer (for known \(\sigma\)).

Example: Beta distribution

Say we have \(X_1,..,X_n \sim B (\alpha, \alpha)\).

Now

\[ f(x; \alpha) = \frac{\Gamma(2\alpha)}{\Gamma(\alpha)^2}[x(1-x)]^{\alpha-1} \]

where \(\Gamma(x)\) is the gamma function, defined by

\[ \Gamma(x)=\int_0^\infty t^{x-1}e^{-t}dt \]

now finding the mle analytically would require us to find the derivative of \(\log \Gamma (\alpha)\), which is impossible. We will have to do this numerically.

Let’s start by creating an example data set:

set.seed(111)
n <- 150
x <- rbeta(n, 2.5, 2.5)
hist(x, 10)

Now

ll <- function(a) {
   -sum(log(dbeta(x, a, a)))
} 
tmp <- nlm(ll, 2.5)
mle <- tmp$estimate  
a <- seq(1, 5, length=250)
y <- rep(0, 250)
for(i in seq_along(a))
  y[i] <- sum(log(dbeta(x, a[i], a[i])))
plot(a, y, type="l")
abline(v=mle)

mle
## [1] 2.314

How about the Fisher information? Now, we can’t even find the first derivative, let alone the second one. We can however estimate it! In fact, we already have all we need.

Notice that the Fisher Information is the (negative of the) expected value of the Hessian matrix, and by the theorem of large numbers \(\frac1{n} \sum H \rightarrow I\). Now if we just replace \(I\) with the observed information we get:

a \(95\%\) confidence interval is given by

hessian <- nlm(ll, 2.5, hessian = TRUE)$hessian
mle + c(-1, 1)*qnorm(1-0.05/2)/sqrt(hessian)
## [1] 1.838 2.791

Let’s put all of this together and write a “find a confidence interval” routine:

mle.est <- 
  function(f, # density
           param, # starting value for nlm
           alpha=0.05, # desired confidence level
           rg, # range for plotting log-likelihood function
           do.graph=FALSE # TRUE if we want to look at the
                          # log-likelihood function
           ) 
{
  ll <- function(a) { # log-likelihood function
     -sum(log(f(a)))
  } 
  tmp <- nlm(ll, param, hessian = TRUE)
  if(do.graph) {  # if you want to see the loglikelihood curve
     a <- seq(rg[1], rg[2], length=250)
     y <- rep(0, 250)
     for(i in seq_along(a))
        y[i] <- sum(log(f(a[i])))
     plot(a, y, type="l")
     abline(v=tmp$estimate)  
  }
  if(length(param)==1) {
    ci <- tmp$estimate + c(-1, 1) *
      qnorm(1-alpha/2)/sqrt(tmp$hessian)
    names(ci) <- c("Lower", "Upper")
  }
  else {
    I.inv <- solve(tmp$hessian) # find matrix inverse
    ci <- matrix(0, length(param), 2)
    colnames(ci) <- c("Lower", "Upper")
    if(!is.null(names(param)))
      rownames(ci) <- names(param)
    for(i in seq_along(param)) 
      ci[i, ] <- tmp$estimate[i] + 
         c(-1, 1)*qnorm(1-alpha/2)*sqrt(I.inv[i, i])
  }
  list(mle=tmp$estimate, ci=ci)
}
mle.est(f = function(a) {dbeta(x, a, a)}, 
        param = 2.5, 
        rg = c(1, 5), 
        do.graph = TRUE)

## $mle
## [1] 2.314
## 
## $ci
## Lower Upper 
## 1.838 2.791

How about the normal case, where we know the correct answer? Let’s compare them:

x <- rnorm(25, 10, 1)
c(mean(x), mean(x) + c(-1, 1)*qnorm(1-0.05/2)/sqrt(25))
## [1] 10.027  9.635 10.419
mle.est(f = function(a) {dnorm(x, a)},  
        param = 10,
        rg = c(5, 15), 
        do.graph = TRUE)

## $mle
## [1] 10.03
## 
## $ci
##  Lower  Upper 
##  9.635 10.419

And how about the multi dimensional parameter case? First again the normal check:

x <- rnorm(200, 5.5, 1.8)
param <- c(5.5, 1.8)
names(param) <- c("mu", "sigma")
mle.est(function(a) {dnorm(x, a[1], a[2])}, param=param)
## $mle
## [1] 5.587 1.676
## 
## $ci
##       Lower Upper
## mu    5.355 5.819
## sigma 1.512 1.840

and now for the Beta:

x <- rbeta(200, 2.5, 3.8)
param <- c(2.5, 3.8)
names(param) <- c("alpha", "beta")
mle.est(function(a) {dbeta(x, a[1], a[2])}, param=param)
## $mle
## [1] 2.349 3.731
## 
## $ci
##       Lower Upper
## alpha 1.913 2.785
## beta  3.011 4.451

Example: Old Faithful geyser

The lengths of an eruption and the waiting time until the next eruption of the Old Faithful geyser in Yellowstone National Park have been studied many times. Let’s focus on the waiting times:

Time <- faithful$Waiting.Time
hist(Time, main="")

How can we model this data? It seems we might have a mixture of two normal distributions. Notice

c(mean(Time[Time<65]), mean(Time[Time>65]))
## [1] 54.05 80.05
c(sd(Time[Time<65]), sd(Time[Time>65]))
## [1] 5.365 5.867
sum(Time<65)/length(Time)
## [1] 0.3456

so maybe a model like this would work:

\[ 0.35N(54, 5.4) + 0.65N(80, 5.9) \]

Let’s see:

hist(Time, main="", freq=FALSE, ylim=c(0, 0.05))
curve(0.35*dnorm(x, 54, 5.4) + 0.65*dnorm(x, 80, 5.9), 40, 100, add=TRUE)

Not to bad!

Can we do better? How about fitting for the parameters?

x <- Time
f <- function(a)  
  a[1]*dnorm(x, a[2], a[3]) + (1-a[1])*dnorm(x, a[4], a[5])
res <- mle.est(f, param=c(0.35, 54, 5.4, 80, 5.9))
res
## $mle
## [1]  0.3609 54.6149  5.8712 80.0911  5.8677
## 
## $ci
##        Lower  Upper
## [1,]  0.2998  0.422
## [2,] 53.2433 55.986
## [3,]  4.8176  6.925
## [4,] 79.1024 81.080
## [5,]  5.0819  6.654

and this looks like

hist(Time, main="", freq=FALSE, ylim=c(0, 0.05))
curve(res$mle[1] * dnorm(x, res$mle[2], res$mle[3]) + 
      (1-res$mle[1])*dnorm(x, res$mle[4], res$mle[5]), 40, 100,
      add=TRUE)

Now this sounds good, and it is, however this is based on having a large enough sample. In order to be sure ours is large enough one usually has to do some kind of coverage study.

Hurricane Maria

How many people died due to Hurricane Maria when it struck Puerto Rico on September 20, 2017? Dr Roberto Rivera and I tried to answer this question. We got the following information from the Department of Health: during the time period September 1st to September 19 there where 1582 deaths. During the period September 20 to October 31 there where 4319.

Now this means that during the time before the hurricane roughly \(1587/19\) = 83.5 people died per day whereas in the \(42\) days after the storm it was \(4319/42\) = 102.8, or \(102.8-83.5 = 19.3\) more per day. This would mean a total of \(42\times 19.3\) = 810.6 deaths cause by Maria in this time period.

Can we find a 95% confidence interval? To start, the number of people who die on any one day is a Binomial random variable with n=3500000 (the population of Puerto Rico) and success(!!!) parameter \(\pi\). Apparently before the storm we had \(\pi = 83.5/3500000\). If we denote the probability to die due to Maria by \(\mu\), we find the probability model

\[ f(x, y) = \text{dbinom}(1587, 19\times3500000, \pi) \\ \text{dbinom}(4319, 42\times3500000, \pi+\mu) \] Let’s see:

N <- 3500000
f <- function(a) -log(dbinom(1582, 19*N, a[1])) - 
    log(dbinom(4319, 42*N, a[1]+a[2]))
nlm(f, c(1582/19/3500000, (4319/42-1582/19)/3350000), hessian = TRUE)
## $minimum
## [1] 9.862
## 
## $estimate
## [1] 2.379e-05 5.842e-06
## 
## $gradient
## [1] 0 0
## 
## $hessian
##      [,1] [,2]
## [1,] -Inf -Inf
## [2,] -Inf -Inf
## 
## $code
## [1] 1
## 
## $iterations
## [1] 1

Oops, that didn’t work. The problem is that the numbers for calculating the Hessian matrix become so small that it can not be done.

What to do? First we can try to use the usual Poisson approximation to the Binomial:

f <- function(a) 
  -log(dpois(1582, 19*a[1])) - log(dpois(4319, 42*(a[1]+a[2])))
res <- nlm(f, c(80, 20), hessian = TRUE)
res
## $minimum
## [1] 9.707
## 
## $estimate
## [1] 83.26 19.57
## 
## $gradient
## [1] -3.840e-12 -7.261e-12
## 
## $hessian
##        [,1]   [,2]
## [1,] 0.6365 0.4084
## [2,] 0.4084 0.4084
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10

and now

round(42*(res$estimate[2] + 
  c(-1, 1)*qnorm(1-0.05/2)*sqrt(solve(res$hessian)[2, 2])))
## [1]  607 1037

An even better solution is to do a bit of math:

\[ \begin{aligned} &\log \left\{ \text{dpois} (x, \lambda) \right\} = \\ &\log \left\{ \frac{\lambda^x}{x!}e^{-\lambda} \right\} = \\ &x \log(\lambda) - \log(x!) - \lambda = \\ \end{aligned} \]

f <- function(a) 
  -1582*log(19*a[1]) + 19*a[1] -
  4319*log(42*(a[1]+a[2])) + 42*(a[1]+a[2])
res <- nlm(f, c(20, 80), hessian = TRUE)
round(42*(res$estimate[2] + 
  c(-1, 1)*qnorm(1-0.05/2)*sqrt(solve(res$hessian)[2, 2])))
## [1]  607 1037

By the way, in the paper we used a somewhat different solution based on the profile likelihood. In this case the answers are quite similar.

The paper is here

UPDATE: After a long legal fight the Department of Health on June 1st 2018 finally updated the numbers:

Notice how in general the number of deaths is much higher in the winter than in the summer. So it may be best to just use the data from February to November:

deaths.before <- 2315+2494+2392+2390+2369+2367+2321+2928-1317
deaths.after <- 1317+3040+2671
deaths.before/231 # Daily Deaths before Maria
## [1] 79.04
deaths.after/72 # Daily Deaths after Maria
## [1] 97.61
round(72*(deaths.after/72 - deaths.before/231)) # point estimate for total deaths due to Maria
## [1] 1337
f <- function(a) 
  -deaths.before*log(231*a[1]) + 231*a[1] -
  deaths.after*log(72*(a[1]+a[2])) + 72*(a[1]+a[2])
res <- nlm(f, c(20, 80), hessian = TRUE)
round(72*(res$estimate[2] + 
  c(-1, 1)*qnorm(1-0.05/2)*sqrt(solve(res$hessian)[2, 2])))
## [1] 1153 1521
Months <- factor(unique(draft$Month), ordered=TRUE)
Deaths <- c(2894, 2315, 2494, 2392, 2390, 2369, 2367, 
            2321, 2928, 3040, 2671, 2820)
ggplot(data=data.frame(x=1:12, y=Deaths), aes(x, y)) +
  geom_point()

Packages for estimation

There are a number of packages available for maximum likelihood fitting:

library(maxLik)
x <- c(1582, 4319)
f <- function(param) {
  x[1]*log(19*param[1]) - 19*param[1] +
  x[2]*log(42*(param[1]+param[2])) - 42*(param[1]+param[2])
}
maxLik(logLik=f, start=c(20, 80))
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 21 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: 41906 (2 free parameter(s))
## Estimate(s): 83.18 19.65

In general these just provide wrappers for the routines mentioned above.