Confidence Intervals

In the last section we studied the problem of point estimation. Generally one also wants to have some idea of the accuracy of this estimate, that is one wants to calculate the standard error. Most commonly this is done by finding a confidence interval.

In essence this provides bounds for the likely values of the parameter. One can control the level of “likely”.

Coverage

the defining property of a confidence interval is its coverage, that is the probability that over many repeated experiments the true parameter lies inside the interval with the nominal level.

Sometime this can be shown analytically:

Example Mean of normal distribution

a \((1-\alpha)100\%\) confidence interval for the mean \(\mu\) of a normal distribution with known standard deviation \(\sigma\) is given by

\[ \bar{X} \pm z_{\alpha/2}\sigma/\sqrt{n} \]

here \(z_\alpha\) are the \((1-\alpha)100\%\) quantiles of a standard normal distribution. They are found with

#alpha = 0.05
qnorm(1-0.05)
## [1] 1.644854

Now let’s denote the density and the cumulative distribution function of a standard normal random variable by \(\phi\) and \(\Phi\), respectively. Then

\[ \begin{aligned} &P(\mu \in I)= \\ &P(\bar{X} - z_{\alpha/2}\sigma/\sqrt{n} < \mu < \bar{X} + z_{\alpha/2}\sigma/\sqrt{n})=\\ &P(\mu - z_{\alpha/2}\sigma/\sqrt{n} < \bar{X} < \mu + z_{\alpha/2}\sigma/\sqrt{n}) = \\ &P(- z_{\alpha/2}\sigma/\sqrt{n} < \bar{X}-\mu < z_{\alpha/2}\sigma/\sqrt{n}) = \\ &P(- z_{\alpha/2} < \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} < z_{\alpha/2}) = \\ &1-2\Phi(z_{\alpha/2})=1-2(\alpha/2)=1-\alpha \end{aligned} \]


Sometimes one has to use simulation to check coverage:

Example Binomial proportion

Let \(X_1,..,X_n \sim Ber(\pi)\) then a \((1-\alpha)100\%\) confidence interval for \(\pi\) can be found with the binom.test command. Let’s check that this method has correct coverage. (This is actually not necessary here because this routine implements a method by Clopper and Pearson (1934), which is exact and has coverage by the way it is constructed.)

B <- 10000
p <- 0.5; n <- 10
ci <- matrix(0, B, 2)
x <- rbinom(B, n, p)
for(i in 1:B) {
  ci[i, ] <- binom.test(x[i], n)$conf.int
}
sum(ci[, 1]<p & p<ci[, 2])/B
## [1] 0.977

Notice that \(0.9768>0.95\), so this method has over-coverage. This is not a good thing but it is ok, and in fact as we will see soon in this case unavoidable.

Of course this should now be done for all (or at least many) values of \(n\) and \(\pi\), and this can take a while. In the case of a discrete random variable there is a quicker way, though. Notice that in the case above (n=10), x can take at most 11 values:

table(x)
## x
##    0    1    2    3    4    5    6    7    8    9   10 
##   10   95  463 1186 2032 2493 2032 1113  451  112   13

and in fact we even know their probabilities:

round(dbinom(0:10, n, p), 4)
##  [1] 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098
## [11] 0.0010

so in the simulation the same 11 intervals are calculated 10000 times.

Now if we denote the interval by \((L(x), U(x))\) if x is observed we have

\[ \begin{aligned} &\text{Cov}(\pi, n) = P( L(X)<\pi<U(X)) =\\ &\sum_{x=0}^n I_{(L(x),U(x))}(\pi)\text{dbinom}(x,n,\pi) \end{aligned} \]

and so we have a much faster way to find the coverage:

cov.bernoulli <- function(n, p) {
  tmp <- 0
  for(i in 0:n) {
    ci <- binom.test(i, n)$conf.int
    if(ci[1]<p & p<ci[2]) tmp <- tmp + dbinom(i, n, p)
  }
  tmp
}
round(cov.bernoulli(10, 0.5), 3)
## [1] 0.979

and this is in fact no longer a simulation but an exact calculation!

Often we draw a coverage graph:

p <- seq(0.1, 0.9, length=100)
y <- p
for(i in 1:100)
  y[i] <- cov.bernoulli(10, p[i])
ggplot(data.frame(p=p, y=y), aes(p, y)) +
  geom_line() +
  geom_hline(yintercept = 0.95) +
  ylim(0.9, 1) 

Notice the ragged appearance of the graph, which is quite typical for coverage graphs of discrete random variables.

Notice also that the actual coverage is always a bit higher than the nominal one. In fact it is well known that the Clopper-Pearson limits are somewhat conservative (aka a bit large). They are still generally a good choice because they don’t depend on any assumptions.

Finding Confidence Interval

There are many ways to approach this problem, we will here discuss confidence intervals based on the method of maximum likelihood.

One major reason for the popularity of this method is the following celebrated theorem, due to Sir R.A. Fisher: under some regularity conditions

\[ \sqrt{n} (\hat \theta -\theta) \sim N(0, \sqrt{I^{-1}}) \] where \(N(\mu, \sigma)\) is the normal distribution and \(I\) is the Fisher Information, given by

\[ I(\theta)_{ij} = -E \left[ \frac{\partial^i\partial^j}{\partial \theta^i\partial \theta^j} \log f(x;\theta) \right] \]

and so it is very easy to find a \((1-\alpha)100\%\) confidence interval for (say) \(\theta_i\) as

\[ \hat \theta \pm z_{\alpha/2} \sqrt{I^{-1}_{ii}} \]

Example: Mean of a normal distribution, \(\sigma\) known.

we have \[ f(x; \mu) = \frac1{\sqrt{2 \pi \sigma^2}} \exp \left\{-\frac1{2\sigma^2} (x-\mu)^2 \right\} \]

Here we have only one parameter (\(\mu\)), so the Fisher Information is given by

\[ I(\mu) = -E\left[ \frac{d^2 \log f(X;\mu)}{d \mu^2} \right] \]

and so we find

\[ \begin{aligned} & \frac{d}{d \mu} \log f(x; \mu)= \frac1{\sigma^2}(x-\mu)\\ & \frac{d^2}{d \mu^2}\log f(x;\mu)=-\frac1{\sigma^2}\\ &-E\left[ \frac{d^2 f(X;\mu)}{d \mu^2} \right] = -E\left[-\frac1{\sigma^2} \right] =\frac1{\sigma^2}\\ &\sqrt{I(\mu)^{-1}} = \sqrt{\frac1{1/\sigma^2} } = \sigma\\ & \sqrt{n}(\hat \mu -\mu) \sim N(0, \sigma) \\ &\hat \mu \sim N(\mu, \sigma/\sqrt{n}) \end{aligned} \]

and we find the \((1-\alpha)100\%\) confidence interval to be

\[ \hat \mu \pm z_{\alpha/2}\sigma/\sqrt{n} \] this is of course the standard answer (for known \(\sigma\)).

Example: Binomial proportion

\[ \begin{aligned} &\log f(x;\pi)= x\log \pi +(1-x)\log (1-\pi) \\ &\frac{d \log f}{d\pi} = \frac{x}{\pi} - \frac{1-x}{1-\pi} \\ &\frac{d^2 \log f}{d\pi^2} = -\frac{x}{\pi^2} - \frac{1-x}{(1-\pi)^2} \\ &I(\pi) = - E \left[ -\frac{X}{\pi^2} - \frac{1-X}{(1-\pi)^2} \right] = \\ &\frac{EX}{\pi^2} + \frac{1-EX}{(1-\pi)^2} =\\ &\frac{\pi}{\pi^2} + \frac{1-\pi}{(1-\pi)^2} =\\ &\frac{1}{\pi} + \frac{1}{1-\pi} =\\ &\frac{1}{\pi(1-\pi)} \\ &\sqrt{I(\pi)^{-1}} = \sqrt{\pi(1-\pi)}\\ &\hat{\pi} \sim N(\pi, \sqrt{\frac{\pi(1-\pi)}{n}}) \end{aligned} \]

and so a \((1-\alpha)100\%\) confidence interval would be given by

\[ \hat{\pi} \pm z_{\alpha/2} \sqrt{\frac{\pi(1-\pi)}{n}} \] But this does not help, we don’t know \(\pi\)! The usual solution is to use a plug-in estimate:

\[ \hat{\pi} \pm z_{\alpha/2} \sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}} \] and this is the standard textbook interval. Recall that we previously mentioned that this is NOT a good solution when n is small and \(\pi\) is either close to 0 or 1. This shows that this method gives us a way to find an interval but it does not guarantee that this interval is good.

If we apply it to our previous example we find a 95% confidence interval of

phat <- 235/567
round(phat + c(-1, 1)*qnorm(0.975)*sqrt(phat*(1-phat)/567), 3)
## [1] 0.374 0.455



Let’s say for the moment that we couldn’t do the math above, that is we have a case where we can’t find the derivatives. We can however estimate it!

Notice that the Fisher Information is the (negative of the) expected value of the Hessian matrix of the log-likelihood function, and by the theorem of large numbers

\[ \frac1{n} \sum H \rightarrow I \] Now if we just replace \(I\) with the observed information we get:

binom.ll <- function(pi, y, n) {-log(dbinom(y, n, pi))}
fit <- nlm(binom.ll, 0.5, hessian = TRUE,y=235, n=567)
fit
## $minimum
## [1] 3.381578
## 
## $estimate
## [1] 0.4144616
## 
## $gradient
## [1] -2.178258e-06
## 
## $hessian
##          [,1]
## [1,] 2336.051
## 
## $code
## [1] 1
## 
## $iterations
## [1] 4

a \(95\%\) confidence interval is given by

round(fit$estimate + 
        c(-1, 1)*qnorm(1-0.05/2)/sqrt(fit$hessian[1, 1]), 3)
## [1] 0.374 0.455

Let’s put all of this together and write a “find a confidence interval” routine:

ci.mle <- 
  function(f, # density
           param, # starting value for nlm
           dta, # data 
           alpha=0.05, # desired confidence level
           rg, # range for plotting log-likelihood function
           do.graph=FALSE # TRUE if we want to look at the
                          # log-likelihood function
           ) 
{
  ll <- function(a, dta) { # log-likelihood function
     -sum(log(f(dta, a)))
  } 
  tmp <- nlm(ll, param, hessian = TRUE, dta=dta)
  if(do.graph) {  # if you want to see the loglikelihood curve
     a <- seq(rg[1], rg[2], length=250)
     y <- rep(0, 250)
     for(i in seq_along(a))
        y[i] <- (-ll(a[i], dta))
     print(ggplot(data.frame(a=a, y=y), aes(a, y)) +
       geom_line(color="blue") +
       geom_vline(xintercept = tmp$estimate))  
  }
  if(length(param)==1) {
    ci <- tmp$estimate + c(-1, 1) *
       qnorm(1-alpha/2)/sqrt(tmp$hessian[1, 1])
    names(ci) <- c("Lower", "Upper")
  }
  else {
    I.inv <- solve(tmp$hessian) # find matrix inverse
    ci <- matrix(0, length(param), 2)
    colnames(ci) <- c("Lower", "Upper")
    if(!is.null(names(param)))
      rownames(ci) <- names(param)
    for(i in seq_along(param)) 
      ci[i, ] <- tmp$estimate[i] + 
         c(-1, 1)*qnorm(1-alpha/2)*sqrt(I.inv[i, i])
  }
  list(mle=tmp$estimate, ci=ci)
}

Example: Beta\((\alpha, \alpha)\)

x <- rbeta(100, 2.5, 2.5)
ci.mle(f = function(dta, a) {dbeta(dta, a, a)}, 
        param = 2.5, 
        dta = x,
        rg = c(1, 5), 
        do.graph = TRUE)

## $mle
## [1] 2.223114
## 
## $ci
##    Lower    Upper 
## 1.664000 2.782228

Example: Normal mean

Here we know the correct answer, so we can compare them:

x <- rnorm(25, 10, 1)
round(mean(x) + c(-1, 1)*qnorm(0.975)/sqrt(25), 2)
## [1]  9.74 10.52
tmp <- ci.mle(f = function(dta, a) {dnorm(dta, a)},  
        param = 10,
        dta=x)
round(tmp$ci, 2)
## Lower Upper 
##  9.74 10.52

More than one parameter

how about the multi dimensional parameter case?

Example: Normal, mean and standard deviation

x <- rnorm(200, 5.5, 1.8)
param <- c(5.5, 1.8)
names(param) <- c("mu", "sigma")
ci.mle(function(dta, a) {dnorm(dta, a[1], a[2])}, 
        param=param,
        dta=x)
## $mle
## [1] 5.442038 1.959368
## 
## $ci
##          Lower    Upper
## mu    5.170489 5.713588
## sigma 1.767306 2.151430

Example: Beta\((\alpha, \beta)\)

x <- rbeta(200, 2.5, 3.8)
param <- c(2.5, 3.8)
names(param) <- c("alpha", "beta")
ci.mle(function(dta, a) {dbeta(dta, a[1], a[2])}, 
        param=param, 
        dta=x)
## $mle
## [1] 2.835470 4.407163
## 
## $ci
##          Lower    Upper
## alpha 2.304746 3.366194
## beta  3.556065 5.258261

Example: Old Faithful guyser

f <- function(dta, a)  
  a[1]*dnorm(dta, a[2], a[3]) + (1-a[1])*dnorm(dta, a[4], a[5])
tmp <- ci.mle(f, 
               param=c(0.35, 54, 5.4, 80, 5.9),
               dta=faithful$Waiting.Time)
tmp
## $mle
## [1]  0.360886 54.614854  5.871218 80.091067  5.867735
## 
## $ci
##           Lower      Upper
## [1,]  0.2997968  0.4219752
## [2,] 53.2433113 55.9863961
## [3,]  4.8175843  6.9248510
## [4,] 79.1023581 81.0797757
## [5,]  5.0818840  6.6535865

and here is what this looks like:

bw <- diff(range(faithful$Waiting.Time))/50 
ggplot(faithful, aes(x=Waiting.Time)) +
  geom_histogram(aes(y = ..density..),
    color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Density") +
  stat_function(fun = f, 
                colour = "blue", 
                args=list(a=tmp$mle))



Now this sounds good, and it is, however this is based on having a large enough sample. In order to be sure ours is large enough one usually has to do some kind of coverage study.

Example: Hurricane Maria

How many people died due to Hurricane Maria when it struck Puerto Rico on September 20, 2017? Dr. Roberto Rivera and I tried to answer this question. In late November 2017 we got the following information from the Department of Health: during the time period September 1-19 there where 1582 deaths. During the period September 20 to October 31 there where 4319.

Now this means that during the time before the hurricane roughly \(1587/19\) = 83.5 people died per day whereas in the \(42\) days after the storm it was \(4319/42\) = 102.8, or \(102.8-83.5 = 19.3\) more per day. This would mean a total of \(42\times 19.3\) = 810.6 deaths caused by Maria in this time period.

Can we find a 95% confidence interval? To start, the number of people who die on any one day is a Binomial random variable with n=3500000 (the population of Puerto Rico) and success(!!!) parameter \(\pi\). Apparently before the storm we had \(\pi = 83.5/3500000\). If we denote the probability to die due to Maria by \(\mu\), we find the probability model

\[ f(x, y) = \text{dbinom}(1587, 19\times3500000, \pi) \\ \text{dbinom}(4319, 42\times3500000, \pi+\mu) \]

Let’s see:

N <- 3500000
f <- function(a) -log(dbinom(1582, 19*N, a[1])) - 
    log(dbinom(4319, 42*N, a[1]+a[2]))
nlm(f, c(1582/19/3500000, (4319/42-1582/19)/3350000), hessian = TRUE)
## $minimum
## [1] 9.862462
## 
## $estimate
## [1] 2.378947e-05 5.841843e-06
## 
## $gradient
## [1] 0 0
## 
## $hessian
##      [,1] [,2]
## [1,] -Inf -Inf
## [2,] -Inf -Inf
## 
## $code
## [1] 1
## 
## $iterations
## [1] 1

Oops, that didn’t work. The problem is that the numbers for calculating the Hessian matrix become so small that it can not be done numerically.

What to do? First we can try to use the usual Poisson approximation to the Binomial. If \(X \sim \text{Bin}(n,\pi)\) and if \(n\) is “large”, \(\pi\) is “small” so that \(\lambda=n \pi\) is “normal”, then \(X \approx \text{Pois}(\lambda)\). So our model becomes

f <- function(a) 
  -log(dpois(1582, 19*a[1])) - log(dpois(4319, 42*(a[1]+a[2])))
res <- nlm(f, c(80, 20), hessian = TRUE)
res
## $minimum
## [1] 9.706561
## 
## $estimate
## [1] 83.26316 19.57017
## 
## $gradient
## [1] -3.41348e-12 -6.35380e-12
## 
## $hessian
##           [,1]      [,2]
## [1,] 0.6365083 0.4083870
## [2,] 0.4083870 0.4084123
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10

and now

round(42*(res$estimate[2] + 
  c(-1, 1)*qnorm(1-0.05/2)*sqrt(solve(res$hessian)[2, 2])))
## [1]  607 1037

An even better solution is to do a bit of math:

\[ \begin{aligned} &\log \left\{ \text{dpois} (x, \lambda) \right\} = \\ &\log \left\{ \frac{\lambda^x}{x!}e^{-\lambda} \right\} = \\ &x \log(\lambda) - \log(x!) - \lambda \\ \end{aligned} \]

f <- function(a) 
  -1582*log(19*a[1]) + 19*a[1] -
  4319*log(42*(a[1]+a[2])) + 42*(a[1]+a[2])
res <- nlm(f, c(20, 80), hessian = TRUE)
round(42*(res$estimate[2] + 
  c(-1, 1)*qnorm(1-0.05/2)*sqrt(solve(res$hessian)[2, 2])))
## [1]  607 1037

By the way, in the paper we used a somewhat different solution based on the profile likelihood. In this case the answers are quite similar.

The paper is here


Notice the term \(\pi+\mu\) above. This indicates that during the time after the storm there were two sources of deaths, the “usual” one and Maria, and that they were additive. One can often consider different parametrizations for such problems. Maybe we should have used \((1+\epsilon)\pi\), indicating an increase in the base mortality rate after the storm. What parametrization is best is generally a question that should be answered by a subject expert.


UPDATE: After a long legal fight the Department of Health on June 1st 2018 finally updated the numbers:

Notice how in general the number of deaths is much higher in the winter than in the summer. So it may be best to just use the data from February to November:

deaths.before <- 2315+2494+2392+2390+2369+2367+2321+2928-1317
deaths.after <- 1317+3040+2671
round(deaths.before/231, 1) # Daily Deaths before Maria
## [1] 79
round(deaths.after/72, 1) # Daily Deaths after Maria
## [1] 97.6
round(72*(deaths.after/72 - deaths.before/231)) # point estimate for total deaths due to Maria
## [1] 1337
f <- function(a) 
  -deaths.before*log(231*a[1]) + 231*a[1] -
  deaths.after*log(72*(a[1]+a[2])) + 72*(a[1]+a[2])
res <- nlm(f, c(20, 80), hessian = TRUE)
round(72*(res$estimate[2] + 
  c(-1, 1)*qnorm(1-0.05/2)*sqrt(solve(res$hessian)[2, 2])))
## [1] 1153 1521
Months <- factor(unique(draft$Month), ordered=TRUE)
Deaths <- c(2894, 2315, 2494, 2392, 2390, 2369, 2367, 
            2321, 2928, 3040, 2671, 2820)
ggplot(data=data.frame(x=1:12, y=Deaths), aes(x, y)) +
  geom_point()

Issues with confidence intervals

It turns out that confidence intervals are not as simple a concept as it appears at first glance. They are routinely misunderstood and misinterpreted. The story of the deaths from Maria is a typical example.

In July 2018 a group of researchers from Harvard University published a study in the New England Journal of Medicine that gave a \(95\%\) confidence interval for the deaths caused by Maria of (793, 8498) (see www.nejm.org/doi/full/10.1056/NEJMsa1803972).

When this paper was published it lead people to the conclusion that most likely about 4645 = (793+8498)/2 died due to the storm. In essence they picked the midpoint of the interval as the most likely single number. However

  • there is no mathematical justification for choosing any one point in a confidence interval as any more likely than any other.

Some time after it was published I received a phone call from a journalist at the Washington Post. He had seen my publication and wanted to know what I (as an expert) thought of the Harvard study. I explained to him this issue of misinterpreting a confidence interval, and in his article he correctly states

The widely reported number of 4,645 is simply the midpoint and is no more or less valid than any other number in the range.

However, a little while later he has the following graph:

and this graph does precisely that: it makes the middle more likely than the edges!

Of course it seems entirely reasonable that the right answer is more in the middle, but this is not how confidence intervals work. For some this is an intrinsic flaw of the whole concept of a confidence interval, and indeed it is certainly not a good thing. Moreover, as we will soon see, there is a way to find intervals that have this precise meaning, using Bayesian analysis. However, they also come with a price.


R Libraries

There are a number of packages available for maximum likelihood fitting:

library(maxLik)
x <- c(1582, 4319)
f <- function(param) {
  x[1]*log(19*param[1]) - 19*param[1] +
  x[2]*log(42*(param[1]+param[2])) - 42*(param[1]+param[2])
}
maxLik(logLik=f, start=c(20, 80))
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 16 iterations
## Return code 2: successive function values within tolerance limit
## Log-Likelihood: 41906.11 (2 free parameter(s))
## Estimate(s): 83.26343 19.54267

In general these just provide wrappers for the routines mentioned above.