In real life point estimates are rarely enough, usually we also need some estimate of the error in our estimate.
A census of all the students at the Colegio 10 years ago showed a mean GPA of 2.75. In our survey of 150 students we find today a mean GPA of 2.53. How much (if at all) has the GPA changed?
The problem of course is that the sample mean GPA depends on the sample, if we repeated our survey tomorrow with a different sample of 150 students, their mean GPA will not again be 2.53. But how far away from 2.53 might it be? Could it actually be higher than 2.75?
One way to answer such questions is to find an interval estimate rather than a point estimate.
Say we have \(X_1, ..., X_n\) iid \(f(x|\theta)\). Then \(\left( L(\pmb{X}), U(\pmb{X}) \right)\) is a \(100(1-\alpha)\%\) confidence interval for \(\theta\) iff
\[ P\left(L(\pmb{X})<\theta< U(\pmb{X}) \right) \ge 1-\alpha \] for all \(\theta\).
Note: in a confidence interval it is the endpoints that are random variables!
say \(X_1, .., X_n \sim N(\mu, \sigma)\), then a \(100(1-\alpha)\%\) confidence interval for the population mean is given by
\[ \bar{x} \pm t_{n-1,\alpha/2}\frac{s}{\sqrt{n}} \]
Here \(t_{n-1,\alpha}\) is the \(1-\alpha\) critical value of a t distribution with n degrees of freedom.
Note that this interval is given in the form point estimate \(\pm\) error, which is quite often true in Statistics, although not always.
We know that \(\sqrt{n}\frac{\bar{X}-\mu}{s}\sim t(n-1)\). Denote the cdf of a t(n) by F, then
\[ \begin{aligned} &P\left(\bar{X} - t_{n-1,\alpha/2}\frac{s}{\sqrt{n}}<\mu< \bar{X} + t_{n-1,\alpha/2}\frac{s}{\sqrt{n}}\right) = \\ &P\left(\mu- t_{n-1,\alpha/2}\frac{s}{\sqrt{n}}<\bar{X}< \mu + t_{n-1,\alpha/2}\frac{s}{\sqrt{n}}\right) = \\ &P\left(-t_{n-1,\alpha/2}<\sqrt{n}\frac{\bar{X}-\mu}{s}< t_{n-1,\alpha/2}\right) = \\ &2P\left(\sqrt{n}\frac{\bar{X}-\mu}{s}< t_{n-1,\alpha/2}\right) -1 = \\ &2F(t_{n-1,\alpha/2})-1 = \\ &2F(F^{-1}(1-\alpha/2))-1 = \\ &2(1-\alpha/2)-1 =1-\alpha \\ \end{aligned} \]
As a numerical example consider the case of the new text book:
n <- 150; xbar <- 2.53; s <- 0.65; alpha <- 0.05
round(xbar+c(-1,1)*qt(1-alpha/2, n-1)*s/sqrt(n), 3)
## [1] 2.425 2.635
What does that mean: a 90% confidence interval for the mean is (2.425, 2.635)? The interpretation is this: suppose that over the next year statisticians (and other people using statistics) all over the world compute 100,000 90% confidence intervals, many for the mean, others maybe for medians or standard deviations or …, than about 90% or about 90,000 of those intervals will actually contain the parameter that is supposed to be estimated, the other 10,000 or so will not.
It is tempting to interpret the confidence interval as follows: having found our 90% confidence interval of (2.425, 2.635), we are now 90% sure that the true mean GPA (the one for all the students at the Colegio) is somewhere between 2.425 and 2.635.
Strictly speaking this interpretation is not correct because once we have computed the interval (2.425, 2.635) the true mean GPA is either in it or not. There is now no longer a frequentist interpretation of probability.
The main property of confidence intervals is their coverage, that is just the equation above.
say \(X_1, .., X_n \sim Ber(p)\), then by the CLT
\[ \bar{X} \sim N\left(p, \sqrt{p(1-p)/n}\right) \] so we have a candidate for a \(100(1-\alpha)\%\) CI:
\[ \bar{x} \pm z_{1-\alpha/2}\sqrt{\bar{x}(1-\bar{x})/n} \]
But this is based on the CLT, so there is a question how large n needs to be for this to work. Let’s see:
n <- 10; alpha <- 0.05; B <- 1000
p <- seq(0.1, 0.9, length = 100)
cov <- 0*p
crit <- qnorm(1 - alpha/2)
for (i in 1:100) {
xbar <- rbinom(B, size = n, prob = p[i])/n
L <- xbar-crit*sqrt(xbar*(1-xbar)/n)
U <- xbar+crit*sqrt(xbar*(1-xbar)/n)
cov[i] <- sum(L < p[i] & U > p[i])/B
}
ggplot(data.frame(p=p, Coverage=cov), aes(p, Coverage)) +
geom_line() +
geom_hline(yintercept=0.95)
As we can see, it does not work very well at all.
Notice the ragged appearance of the coverage graph. This is typical for discrete rv’s like the Bernoulli.
Here we find the coverage using simulation, which is fine and even uses the spirit of confidence intervals.
Because this is a discrete random variable we can however also calculate the coverage exactly, using R : say we want to find the coverage for the case n=10, p=0.43. In this case x is one of 0, 1, 2.., 10, so \(\bar{x}\) can only have values 0/10, 1/10, …, 10/10, and so we can find all the possible intervals:
n <- 10; xbar <- 0:n/n
LU <- matrix(0, n+1, 2)
LU[, 1] <- xbar-crit*sqrt(xbar*(1-xbar)/n)
LU[, 2] <- xbar+crit*sqrt(xbar*(1-xbar)/n)
df <- round(data.frame(x=0:n, xbar=xbar, L=LU[, 1], U=LU[, 2]), 3)
kable.nice(df, do.row.names = FALSE)
x | xbar | L | U |
---|---|---|---|
0 | 0.0 | 0.000 | 0.000 |
1 | 0.1 | -0.086 | 0.286 |
2 | 0.2 | -0.048 | 0.448 |
3 | 0.3 | 0.016 | 0.584 |
4 | 0.4 | 0.096 | 0.704 |
5 | 0.5 | 0.190 | 0.810 |
6 | 0.6 | 0.296 | 0.904 |
7 | 0.7 | 0.416 | 0.984 |
8 | 0.8 | 0.552 | 1.048 |
9 | 0.9 | 0.714 | 1.086 |
10 | 1.0 | 1.000 | 1.000 |
Notice that for x=2 to x=7 we have intervals that contain p=0.43, so the true coverage is
\[ \begin{aligned} &\text{Coverage} = P(L(X)<p<U(X)\vert p=0.43) =\\ &P(2\le X\le 7\vert p=0.43) = \\ &\sum_{i=2}^7 {{10}\choose {i}} 0.43^i(1-0.43)^{10-i} \end{aligned} \]
round(sum(dbinom(2:7, n, 0.43)), 4)
## [1] 0.9489
Let’s redo the coverage graph above, now using exact calculations:
n <- 10; alpha <- 0.05
xbar <- 0:n/n
p <- seq(0.01, 0.99, length = 250)
cov <- 0*p
crit <- qnorm(1 - alpha/2)
LU <- matrix(0, n+1, 2)
LU[, 1] <- xbar-crit*sqrt(xbar*(1-xbar)/n)
LU[, 2] <- xbar+crit*sqrt(xbar*(1-xbar)/n)
for (i in 1:250) {
m <- (0:n)[LU[, 1]<p[i] & p[i]<LU[, 2]]
cov[i] <- sum(dbinom(m, n, p[i]))
}
ggplot(data.frame(p=p, Coverage=cov), aes(p, Coverage)) +
geom_line()+
geom_hline(yintercept=0.95)
A much better method was invented by Clopper and Pearson in 1934. It is implemented in R in the binom.test command. Let’s check its coverage:
n <- 10; alpha <- 0.05
p <- seq(0.01, 0.99, length = 250)
cov <- 0*p
LU <- matrix(0, n+1, 2)
for(x in 0:n) LU[x+1, ] <- binom.test(x, n)$conf.int
for (i in 1:250) {
m <- (0:n)[LU[, 1]<p[i] & p[i]<LU[, 2]]
cov[i] <- sum(dbinom(m, n, p[i]))
}
ggplot(data.frame(p=p, Coverage=cov), aes(p, Coverage)) +
geom_line()+
geom_hline(yintercept=0.95)
Now this method has some over-coverage. It is said to be conservative, that is its intervals are a bit larger than they need to be. This is not nice but often acceptable, whereas under-coverage is not.
say \(X \sim N(\mu, 1)\) and we are told that the routine norm1 calculates \(95\%\) confidence intervals for \(\mu\). Let’s check its coverage.
Again we can use simulation:
mu <- 0.76; B <- 1e4
x <- rnorm(B, mu)
LU <- norm1(x)
head(round(cbind(x, LU), 3))
## x
## [1,] 0.898 -1.062 2.858
## [2,] 0.089 -1.871 2.049
## [3,] 1.955 -0.005 3.915
## [4,] 0.169 -1.791 2.129
## [5,] 0.577 -1.383 2.537
## [6,] 0.100 -1.860 2.059
sum(LU[, 1]<mu & mu<LU[, 2])/B
## [1] 0.9498
Can we do this here as well without simulation?
x <- seq(-3, 3, 0.01)
LU <- norm1(x)
df <- data.frame(x=c(x, x),
y=c(LU[, 1], LU[, 2]),
Limit=rep(c("L", "U"), each=length(x)))
ggplot(data=df, aes(x, y, color=Limit)) +
geom_line()
shows that the limits are strictly increasing. This need not be the case in general! So say we want to check coverage for \(\mu\). Then there exists \(x_1\) such that \(L(x_1)=\mu\) and \(x_2\) such that \(L(x_2)=\mu\). For example, if \(\mu=0.5\) we find
x1 <- x[abs(LU[, 1]-0.5)==min(abs(LU[, 1]-0.5))]
x2 <- x[abs(LU[, 2]-0.5)==min(abs(LU[, 2]-0.5))]
ggplot(data=df, aes(x, y, color=Limit)) +
geom_line() +
geom_hline(yintercept = 0.5) +
geom_vline(xintercept = c(x1, x2))
and clearly
\[ \text{cov}(\mu) = \int_{x_1}^{x_2}f(x;\mu)dx \]
round(diff(pnorm(c(x2, x1), 0.5)), 4)
## [1] 0.95
Say we have \(X_1, .., X_n\) iid \(f(x|\theta)\). Then \(\left(L(\pmb{x}), U(\pmb{x})\right)\) is a \(100(1-\alpha)\%\) credible interval for \(\theta\) iff
\[ P\left(L(\pmb{x})<\theta< U(\pmb{x}) | \pmb{x}\right)=1-\alpha \]
Notice now the data appears in the conditional part, so this is a probability based on the posterior distribution of \(\theta|\pmb{x}\).
Also note that now the upper and lower limits are not random variables.
say \(X_1, .., X_n \sim N(\mu, \sigma)\)
To keep things simple we will assume that \(\sigma\) is known, so we just need a prior on \(\mu\). Let’s say we use \(\mu \sim N(\mu_0, \tau)\), then we have previously seen that \(\mu|\pmb{x} \sim N(a, b)\) where
\[ \begin{aligned} &c^2 = \frac{n}{\sigma^2} + \frac1{\tau^2}\\ &a = \frac{(\sum x_i)/\sigma^2+\mu_0^2/\tau^2}{c^2} \\ &b = 1/c\\ \end{aligned} \]
is again a normal.
How can we get a credible interval from this? The definition above does not determine a unique interval, essentially we have one equation for two unknowns, so we need an additional condition. Here are some standard solutions:
equal tail probability intervals: choose L and U such that \[ P\left(\theta<L(\pmb{x})| \pmb{x}\right)=\alpha/2\\ P\left(\theta>U(\pmb{x})| \pmb{x}\right)=\alpha/2 \]
highest posterior density intervals (HPD). Here we choose the limits in such a way that the density has the same value. That is, we have the solution to the system of equations
\[ \begin{aligned} &P\left(L(\pmb{x})<\theta< U(\pmb{x}) | \pmb{x}\right)=1-\alpha \\ &f(L(\pmb{x})| \pmb{x}) = f(U(\pmb{x})| \pmb{x}) \end{aligned} \]
Let’s consider as a numerical example the text book data:
First we need to choose \(\sigma, \mu_0\) and \(\tau\). Let’s use \(\sigma=0.65, \mu_0=3.0, \tau=1.0\), then
n <- 150; mu0 <- 3.0; sigma <- 0.65; tau <- 1.0
c2 <- n/sigma^2+1/tau^2
a <- (150*2.53/sigma^2+mu0^2/tau^2)/c2
b <- 1/sqrt(c2)
round(qnorm(c(0.025, 0.975), a, b), 2)
## [1] 2.44 2.65
highest posterior density interval. In our case this yields the same interval as above because the normal density is symmetric around the mean.
quantiles from simulated data.
round(quantile(rnorm(1e4, a, b), c(0.025, 0.975) ), 2)
## 2.5% 97.5%
## 2.45 2.65
The main property of credible intervals is just the equation that defines them.
say \(X_1, .., X_n\sim Ber(p)\), and let’s use as a prior on p the U[0,1]. Then we saw before that the posterior distribution is \(\text{Beta}(\sum x_i +1, n+1-\sum x_i)\).
\[L(x) = qbeta(\alpha/2, 1+\sum x, n+1-\sum x)\]
\[U(x) = qbeta(1-\alpha/2, \sum x, n+1-\sum x)\]
Let \(y=\sum x_i\), then we need to solve the system of equations
\[ \begin{aligned} &\int_a^b \frac{(n+1)!}{y!(n-y)!}t^{y}(1-t)^{n-y} dt = 1-\alpha\\ &a^{y}(1-a)^{n-y} = b^{y}(1-b)^{n-y} \end{aligned} \] and this has to be done numerically.
say \(X_1, .., X_n \sim N(\mu, \sigma)\) and we are interested in estimating \(\mu\) and \(\sigma\) simultaneously. So we want to find a region \(A(\pmb{x}) \in \mathbb{R}\times\mathbb{R}\) such that
\[ P(A(\mathbf{X}))=1-\alpha \]
We already know that \(\bar{x}\) and \(s\) are good point estimators of \(\mu\) and \(\sigma\). Moreover it can be shown that \(\bar{x}\) and \(s\) are independent, so
\[ \begin{aligned} &P\left(\bar{X} - t_{n-1,w/2}\frac{s}{\sqrt{n}}<\mu< \bar{X} + t_{n-1,w/2}\frac{\sigma}{\sqrt{n}}\right., \\ &\left.\frac{(n-1)s^2}{q\chi^2(1-w/2)} <\sigma^2< \frac{(n-1)s^2}{q\chi^2(w/2)}\right)=\\ &P\left(\bar{X} - t_{n-1,w/2}\frac{s}{\sqrt{n}}<\mu< \bar{X} + t_{n-1,w/2}\frac{\sigma}{\sqrt{n}}\right)\cdot \\ &P\left(\frac{(n-1)s^2}{q\chi^2(1-w/2)} <\sigma^2< = \frac{(n-1)s^2}{q\chi^2(w/2)}\right)=\\ &(1-w/2)^2 \end{aligned} \]
so if we want a \((1-\alpha)100\%\) confidence region we need to use
\[w=1-\sqrt{1-\alpha}\] What does this region look like?
n <- 100; mu0 <- 0; sigma <- 1
x <- rnorm(100, mu0, sigma)
xbar <- mean(x)
s <- sd(x)
w <- 1 - sqrt(1 - 0.05/2)
a <- qnorm(1 - w/2)/sqrt(n)
cv1 <- xbar + c(-1, 1)*qnorm(1-w/2)*sigma/sqrt(n)
cv2 <- c((n-1)*s^2/qchisq(1-w/2, n-1),
(n-1)*s^2/qchisq(w/2, n-1))
mu <- seq(cv1[1]-0.5, cv1[2]+0.5, length = 250)
sigma <- seq(0.9*cv2[1]^0.5, 1.1*cv2[2]^0.5, length = 250)
xy <- expand.grid(mu, sigma)
I <- rep(FALSE, dim(xy)[1])
for(i in 1:dim(xy)[1]) {
if(xbar-a*xy[i, 2] < xy[i, 1] &
xy[i, 1] < xbar+a*xy[i, 2] &
cv2[1] < xy[i, 2]^2 &
xy[i, 2]^2 < cv2[2]) I[i] <- TRUE
}
df <- data.frame(x=xy[I,1], y=xy[I, 2])
ggplot(data=df, aes(x, y)) +
geom_point(alpha=0.3)
Say we have some density f(.;a,b) and a 95% confidence region for (a, b) that looks like this:
But actually, we want a 95% confidence interval for a. So how about
about (1.8, 6.3). But this will clearly overcover badly because it would be the same one-dimensional interval as we would get if the confidence regions were a square:
but of course the region of the square is much larger. There is in fact no known way to get confidence intervals from confidence regions!