The Bootstrap - Introduction

Example (8.1.1)

say we have rv’s X where \(X \sim N(\mu, \sigma)\) (\(\sigma\) known), and we are interested in estimating \(\mu\). We can use the sample mean \(\bar{x}\). What is the standard error of this estimate? Of course it is \(\sigma/\sqrt{n}\).

Let’s instead say we want to use the median. Now what is the standard error? To find it we would first have to find the distribution of the sample median and then its variance.

Instead we can of course use simulation:

B <- 10000; n <- 50; mu <- 2.5; sigma <- 1
x <- matrix(rnorm(B*n, mu, sigma), ncol=n)
xbar <- apply(x, 1, mean)
M <- apply(x, 1, median)
round(c(sigma/sqrt(n), sd(xbar), sd(M)), 3)
## [1] 0.141 0.141 0.174

Now let’s say that the \(x_i\)’s come from some unknown distribution \(F(.;\theta)\). We can no longer do the simulation because we do not know what to simulate from. Instead we can use a method called the bootstrap.

It starts of very strangely: instead of sampling from a distribution as in a standard Monte Carlo study, we will now resample the data itself, that is if the data is n observations x1, .., xn, then the bootstrap sample is also n numbers with replacement from x1, .., xn, that is \(x^*_1\) is any of the original x1, .., xn with probability 1/n. 

In any one bootstrap sample an original observation, say x1, may appear once, several times, or not at all.

Example (8.1.2)

say the data is (5.1, 2.3, 6.4, 7.8, 4.6), then one possible bootstrap sample is (6.4, 4.6, 2.3, 6.4, 5.1).

Say we have a sample x1, .., xn from some unknown distribution F and we wish to estimate some parameter \(\theta = t(F)\). For this we find some estimate

\[\hat{\theta}=s(\mathbf{x})\]

How accurate is \(\hat{\theta}\)?

Example (8.1.3)

\(X_1, ..., X_n\sim F; \theta = E(X_1)\) so

\[t(F)= \int xf(x)dx\]

and

\[s(\mathbf{x})=\bar{X}\]

Note \(t\) is called a functional.

Example (8.1.4)

\(X_1, ..., X_n\sim F;\theta = var(X_1)\), so

\[t(F)= \int (x-\mu)^2 f(x)dx\]

and

\[s(\pmb{x})=1/(n-1)\sum (x_i-\bar{x} )^2\]

Here is the algorithm to find the bootstrap estimate of the standard error in \(\hat{\theta}\):

  1. Select B independent bootstrap samples \(\mathbf{x}^*_1, .., \mathbf{x}^*_B\), each consisting of n data values drawn with replacement from \(\mathbf{x}\). Here B is usually on the order 2000.

  2. Evaluate the bootstrap replication corresponding to each bootstrap sample, \(\hat{\theta}^{*}_b = s(\mathbf{x}^*_b)\), b=1,..,B

  3. Estimate the standard error \(se_f(\hat{\theta})\) by the sample standard deviation of the bootstrap replications.

Example (8.1.5)

say the data is (5.1, 2.3, 6.4, 7.8, 4.6) and we want to estimate the mean \(\mu\), then

x <- c(5.1, 2.3, 6.4, 7.8, 4.6)
B <- 2000
thetastar <- rep(0, B)
for(i in 1:B) {
  xstar <- sample(x, size=length(x), replace = TRUE) 
                          # with replacement!
  thetastar[i] <- mean(xstar)
}
sd(thetastar)
## [1] 0.846476

R has a library to do most of the work for us:

library(bootstrap)
sd(bootstrap(x, 2000, mean)$thetastar)
## [1] 0.8366545

Example (8.1.6)

say the following is a sample from some distribution F:

cat(x[c(1:10, 491:500)])
0.19 0.24 0.5 0.51 0.59 0.62 0.64 0.65 0.68 0.68 8.04 8.55 8.82 8.83 9.17 9.23 9.39 10.67 11.51 13.99

and we want to find \(95\%\) confidence intervals for \(\theta=E[X]\).

Now this distribution is not a normal:

bw <- diff(range(x))/50 
ggplot(data.frame(x=x), aes(x)) +
  geom_histogram(aes(y = ..density..),
    color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Density") +
    stat_function(fun = dnorm, colour = "blue", 
                  args=list(mean=mean(x),sd=sd(x)))

but we can use the bootstrap instead:

thetastar <- bootstrap(x, 2000, mean)$thetastar

Notice that in fact the thetastar’s do have a normal distribution:

bw <- diff(range(thetastar))/50 
ggplot(data.frame(x=thetastar), aes(x)) +
  geom_histogram(aes(y = ..density..),
    color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Density") +
    stat_function(fun = dnorm, colour = "blue", 
                  args=list(mean=mean(thetastar),sd=sd(thetastar)))

and so we can find the confidence interval:

round(mean(x) +c(-1, 1)*qnorm(0.975)*sd(thetastar), 2)
## [1] 2.97 3.30

Note there is no \(\sqrt n\) because sd(thetastar) is already the standard error of the estimate, not of the original data.

Here the bootstrap estimates are normal, which is quite often the case. If they are not we could use a CI based on the percentiles:

round(quantile(thetastar, c(0.025, 0.975)), 2)
##  2.5% 97.5% 
##  2.97  3.31

This idea of the bootstrap is very strange: at first it seems we are getting more out of the data than we should. It is also a fairly new idea, invented by Bradley Efron in the 1980’s.

Here is some justification why it works:

Let’s say we have \(X_1, ..., X_n\sim F\) for some cdf F, and we want to investigate the properties of some parameter \(\theta\) of F, for example its mean or its median. We have an estimator of \(\theta\), say \(s(x_1, .., x_n)\), for example \(\bar{x}\) in the case of the mean.

What is the error in \(s(x_1, .., x_n)\)? In the case of the mean this is very easy and we already know that the answer is \(sd(x_1)/\sqrt n\).

But what if we don’t know it and we want to use Monte Carlo simulation to find out? Formally what this means is the following:

  1. generate \(X'_1, ..., X'_n\sim F\)
  2. find the \(\theta'=s(x'_1, .., x'_n)\)
  3. repeat 1 and 2 many times (say 1000 times)
  4. Study the MC estimates of \(\theta\), for example find their standard deviation.

But what do we do if we don’t know that our sample came from F?

A simple idea then is to replace sampling from the actual distribution function by sampling from the next best thing, the empirical distribution function. So the idea of the bootstrap is simple: replace F in the simulation above with Fhat:

  1. generate \(X'_1, ..., X'_n\sim \hat{F}\)
  2. find the \(\theta'=s(x'_1, .., x'_n)\)
  3. repeat 1 and 2 many times (say 1000 times)
  4. Study the MC estimates of \(\theta\), for example find their standard deviation.

What does it mean, generate \(X'_1, .., X'_n\) from the empirical distribution function of \(X_1, .., X_n\)? Actually it means finding a bootstrap sample as described above.

Example (8.1.7)

Let’s return to the experiment on mice we discussed at the beginning of the class. Below we have the results of a small experiment, in which 7 out of 16 mice were randomly selected to receive a new medical treatment, while the remaining 9 mice were assigned to the control group. The treatment was intended to prolong survival after surgery:

Treatment Control
94 52
197 104
16 146
38 10
99 50
141 31
23 40
27
46

How can we answer the question on whether this new treatment is effective? First of course we can find the within group means and standard deviations:

round(unlist(lapply(mice, mean)), 1)
## treatment   control 
##      86.9      56.2
round(unlist(lapply(mice, sd))/sqrt(unlist(lapply(mice, length))), 2)
## treatment   control 
##     25.24     14.14

so we see that the mice who received the treatment lived on average 30.63 days longer. But unfortunately the standard error of the difference is 28.93 = \(\sqrt{(25.24^2+14.14^2)}\), so we see that the observed difference 30.63 is only 30.63/28.93 = 1.05 standard deviations above 0.

Let’s say next that instead of using the mean we wish to use the median to measure average survival. We find the following:

round(unlist(lapply(mice, median)), 1)
## treatment   control 
##        94        46

Now we get a difference in median survival time of 48 days, but what is the standard error of this estimate? Of course there is a formula for the standard error of the median, but it is not simple and just finding it in a textbook would be some work. On the other hand we can use the bootstrap method to find it very easily:

treatstar <- bootstrap(mice$treatment, 1000, median)$thetastar
contstar <- bootstrap(mice$control, 1000, median)$thetastar
sds <- round(c(sd(treatstar), sd(contstar)), 2)
sds
## [1] 37.71 12.71
sqrt(sum(sds^2))
## [1] 39.79432
round(48/sqrt(sum(sds^2)), 3)
## [1] 1.206

So the difference is 1.15 standard deviations larger than 0.

This is larger than the one for the mean, but still not statistically significant.