The Bootstrap

In the previous section we used simulation from the true distribution to derive statistical methods. But what do we do if we don’t know the distribution?

The idea of the Bootstrap is rather strange: say we have some data from some distribution and we want to use it to estimate some parameter \(\theta\). We have a formula (a statistic) \(T(x_1, .. , x_n)\). What is the standard error in this estimate? That is, what is \(sd[T(X_1, .. , X_n)]\)?

Sometimes we can do this mathematically: Let’s assume that the \(X_i\) are iid (independent and identically distributed) and we are interested in the mean. Let’s write \({\bf X}=(X_1, .. , X_n)\), then

\[ \begin{aligned} &\theta = E[X_1]\\ &T{\bf X} = \frac1{n} \sum_{i=1}^n X_i\\ &E[T{\bf X}] = E[\frac1{n} \sum_{i=1}^n X_i] = \frac1{n} \sum_{i=1}^n E[X_i] = \frac1{n} n \theta = \theta \\ &Var[T{\bf X})] = E \left[ \left( \frac1{n}\sum_{i=1}^n X_i - \theta \right)^2 \right] = \\ &\frac1{n^2} E \left[ \left( \sum_{i=1}^n X_i - n\theta \right)^2 \right] = \\ &\frac1{n^2} E \left[ \left( \sum_{i=1}^n (X_i - \theta) \right)^2 \right] = \\ &\frac1{n^2} E \left[ \sum_{i,j=1}^n \left( X_i - \theta \right)\left( X_j - \theta \right) \right] = \\ &\frac1{n^2} \left[ \sum_{i=1}^n E(X_i - \theta )^2 + \sum_{i,j=1,i \ne j}^n E(X_i - \theta )(X_j - \theta ) \right] =\\ &\frac1{n^2} \left[ n E(X_1 - \theta )^2 + 0 \right] = \frac1{n}Var[X_1]\\ \end{aligned} \]

because

\[ E(X_i - \theta )^2 =E(X_1 - \theta )^2 \] (identically distributed) and

\[E(X_i - \theta )(X_j - \theta )=E(X_i - \theta )E(X_j - \theta )=0\] because of independence.

But let’s say that instead of the mean we want to estimate \(\theta\) with the median. Now what is the \(sd[\text{median}(X_1,.., X_n)]\)? This can still be done analytically, but is already much more complicated.

It would of course be easy if we could simulate from the distribution:

sim.theta <- function(B=1e4, n, mu=0, sig=1) {
  x <- matrix(rnorm(B*n, mu, sig), B, n)
  xbar <- apply(x, 1, mean)
  med <- apply(x, 1, median)
  round(c(sig/sqrt(n), sd(xbar), sd(med)), 3)
}
sim.theta(n=25)
## [1] 0.200 0.198 0.247

But what do we do if didn’t know that the data comes from the normal distribution? Then we can’t simulate from \(F\). We can, however simulate from the next best thing, namely the empirical distribution function edf \(\hat F\). It is defined as:

\[ \hat F (x) = \frac1{n} \sum_{i=1}^n I_{(-\infty, x]} (X_i) = \frac{\# X_i \le x}{n} \]

Here are two examples:

x <- rnorm(25)
plt <- ggplot(data.frame(x=x), aes(x)) + 
          stat_ecdf(geom = "step", size=1.2) 
ggcurve(plt, fun=function(x) pnorm(x), A=-3, B=3)  

x <- rnorm(250)
plt <- ggplot(data.frame(x=x), aes(x)) + 
          stat_ecdf(geom = "step", size=1.2) 
ggcurve(plt, fun=function(x) pnorm(x), A=-3, B=3)  

There is a famous theorem in probability theory (Glivenko-Cantelli) that says that the empirical distribution function converges to the true distribution function uniformly.

How does one simulate from \(\hat F\)? It means to resample from the data, that is randomly select numbers from x with replacement such that each observation has an equal chance of getting picked:

x <- sort(round(rnorm(10, 10, 3), 1))
x
##  [1]  1.2  9.7 10.3 11.9 12.0 13.4 13.5 13.9 14.0 14.2
sort(sample(x, size=10, replace=TRUE))
##  [1]  1.2  1.2  9.7 10.3 11.9 12.0 12.0 13.5 14.0 14.2
sort(sample(x, size=10, replace=TRUE))
##  [1]  1.2  1.2  9.7  9.7 11.9 11.9 12.0 13.4 14.0 14.2
sort(sample(x, size=10, replace=TRUE))
##  [1] 10.3 10.3 13.4 13.4 13.4 13.9 14.0 14.2 14.2 14.2

Now the Bootstrap estimate of standard error is simply the sample standard deviation of the estimates of B such bootstrap samples:

x <- rnorm(250)
B <- 1000
z <- matrix(0, B, 2)
for(i in 1:B) {
  x.boot <- sample(x, size=length(x), replace=TRUE)
  z[i, 1]<- mean(x.boot)
  z[i, 2]<- median(x.boot)
}
round(c(1/sqrt(length(x)), apply(z, 2, sd)), 3)
## [1] 0.063 0.063 0.077

There is also a package that we can use:

library(bootstrap)
sd(bootstrap(x, 1000, mean)$thetastar)
## [1] 0.06633883
sd(bootstrap(x, 1000, median)$thetastar)
## [1] 0.08329715

Example: Skewness

the skewness of a distribution is a measure of it’s lack of symmetry. It is defined by

\[ \gamma_1=E\left[ \left( \frac{X-\mu}{\sigma}\right)^3\right] \]

and for a symmetric distribution we should have \(\gamma_1=0\).

a standard estimator of \(\gamma_1\) is

\[ \hat \gamma_1 = \frac{\frac1{n}\sum(x_i- \overline x)^3}{[\frac1{n-1}\sum(x_i- \overline x)^2]^{3/2}} = \frac{\frac1{n}\sum(x_i- \overline x)^3}{[\text{sd(x)}]^{3}} \]

What is the standard error in this estimate? Doing this analytically would be quite an exercise, but:

x <- seq(0, 15, length=250)
y1 <- dnorm(x, 7, 2)
y2 <- dgamma(x, 2.5, 1/2)
df <- data.frame(
  x=c(x, x),
  y=c(y1, y2),
  which=rep(c("N(5, 2)",  "Gamma(2.5, 1/2)"), each=250))
ggplot(df, aes(x, y, color=which)) +
  geom_line()

so the normal is symmetric but the Gamma is not.

T.fun <- function(x) mean( (x-mean(x))^3 )/sd(x)^3 
x <- rnorm(250, 5, 2)
x.boot <- bootstrap(x, 500, T.fun)$thetastar
round(c(mean(x.boot), sd(x.boot)), 3)
## [1] -0.042  0.134
x <- rgamma(250, 2.5, 1/2)
x.boot <- bootstrap(x, 500, T.fun)$thetastar
round(c(mean(x.boot), sd(x.boot)), 3)
## [1] 1.320 0.253

Bootstrap Confidence Intervals

There are two standard technics for using the Bootstrap to find confidence intervals:

  • Normal Theory Intervals

Let’s continue the discussion of the skewness, and put a 95% confidence interval on the estimates:

x.normal <- rnorm(250, 5, 2)
T.fun <- function(x) mean( (x-mean(x))^3 )/sd(x)^3
thetastar.normal <- bootstrap(x.normal, 2000, T.fun)$thetastar
df <- data.frame(x = thetastar.normal)
bw <- diff(range(thetastar.normal))/50
ggplot(df, aes(x)) +
  geom_histogram(color = "black", fill = "white", binwidth = bw) + 
  labs(x="x", y="Counts")

x.gamma <- rgamma(250, 2.5, 1/2)
thetastar.gamma <- bootstrap(x.gamma, 2000, T.fun)$thetastar
df <- data.frame(x = thetastar.gamma)
bw <- diff(range(thetastar.gamma))/50
ggplot(df, aes(x)) +
  geom_histogram(color = "black", 
                 fill = "white", 
                 binwidth = bw) + 
  labs(x="x", y="Counts")

Note that I increased the number of Bootstrap samples to 2000, which is standard when calculating confidence intervals.

We can see that the bootstrap estimates are reasonably normally distributed, so we can find the confidence interval with

round(T.fun(x.normal) + 
        c(-1, 1)*qnorm(0.975)*sd(thetastar.normal), 2)
## [1] -0.38  0.10
round(T.fun(x.gamma) + 
        c(-1, 1)*qnorm(0.975)*sd(thetastar.gamma), 2)
## [1] 0.74 1.23

so in the normal case 0 is in the interval, indicating that this data set might well come from a symmetric distribution, whereas in the gamma case this is ruled out.

Notice that there is no \(/\sqrt{n}\), because sd(thetastar) is already the standard deviation of the estimator, not of an individual observation.

  • Percentile Intervals

An alternative way to find confidence intervals is by estimating the population quantiles of the bootstrap sample with the sample quantiles:

2000*c(0.025, 0.975)
## [1]   50 1950
round(sort(thetastar.normal)[2000*c(0.025, 0.975)], 2)
## [1] -0.40  0.08
round(sort(thetastar.gamma)[2000*c(0.025, 0.975)], 2)
## [1] 0.72 1.22

in our examples the two methods yield similar intervals.

  • More Advanced Intervals

There are a number of ways to improve the performance of bootstrap based confidence intervals. One of the more popular ones is called nonparametric bias-corrected and accelerated (BCa) intervals. The package bootstrap has the routine bcanon. The intervals are the found via the percentile method but the percentiles are found with

\[ \begin{aligned} &\alpha_1 = \Phi \left( \widehat {z_0} + \frac{\widehat {z_0}+ z_\alpha}{1-\hat a (\widehat {z_0}+ z_\alpha)} \right) \\ &\alpha_2 = \Phi \left( \widehat {z_0} + \frac{\widehat {z_0}+ z_{1-\alpha}}{1-\hat a (\widehat {z_0}+ z_{1-\alpha})} \right) \end{aligned} \] here

  • \(\Phi\) is the standard normal cdf
  • \(\alpha\) is the desired confidence level
  • \(\widehat {z_0}\) is a bias-correction factor
  • \(\hat a\) is called the acceleration
bcanon(x.normal, 2000, T.fun, alpha=c(0.025, 0.975))$conf
##      alpha   bca point
## [1,] 0.025 -0.39388296
## [2,] 0.975  0.07641379
bcanon(x.gamma, 2000, T.fun, alpha=c(0.025, 0.975))$conf
##      alpha bca point
## [1,] 0.025 0.7479726
## [2,] 0.975 1.2434498

Example

Let’s consider the relationship between the waiting times and the lengths of the eruptions in the Old Faithful data:

ggplot(data=faithful, aes(Eruptions, Waiting.Time)) +
  geom_point() 

The correlation is

round(cor(faithful)[1, 2], 3)
## [1] 0.901

say we want to find a \(95\%\) confidence interval for the correlation. Now the standard solution is based on the assumption that the data comes from a bivariate normal distribution, which is not correct because of the “hole” in the middle. But

B <- 1e4
n <- dim(faithful)[1]
cor.boot <- rep(0, B)
for(i in 1:B) 
  cor.boot[i] <- cor(faithful[sample(1:n, 
                                     size=n,
                                     replace=TRUE), ])[1, 2]
round(quantile(cor.boot, c(0.025, 0.975)), 3)
##  2.5% 97.5% 
## 0.883 0.917

Here is another interesting use of the Bootstrap: say we are interested in the least squares regression line:

ggplot(data=faithful, aes(Eruptions, Waiting.Time)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE)

how much variation is in this line, that is how much does it depend on the exact sample?

x <- seq(1.5, 5.25, length=250)
y <- matrix(0, 250, 100)
for(i in 1:100) {
  fit <- lm(Waiting.Time~Eruptions,
            data=faithful[sample(1:n, size=n,
                         replace=TRUE), ])
  y[, i] <- coef(fit)[1]+coef(fit)[2]*x
}
L <- apply(y, 1, quantile, probs=0.025)
U <- apply(y, 1, quantile, probs=0.975)
ggplot(data=faithful, aes(Eruptions, Waiting.Time)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  geom_line(data=data.frame(x=x, y=L), aes(x, y), 
            color="red") +
  geom_line(data=data.frame(x=x, y=U), aes(x, y), 
            color="red")

this is called a confidence band. They are quite popular in many fields, there are however serious questions of how to interpret them because of the issue of simultaneous inference.