boot1

Some Applications of the Bootstrap

library(bootstrap)
B <- 2000

Let’s revisit some of the examples we discussed earlier, and analyze them using the bootstrap:

Example (8.2.1)

In a survey of 1000 likely voters 523 said they will vote for party A, the other 477 for party B. Find a 95% CI for the lead of one party over the other.

dta <- c(rep(1, 523), rep(0, 477))
s <- function(x) abs(sum(x) - sum(1 - x))/length(x)
lead <- s(dta)
thetastar <- bootstrap(dta, B, s)$thetastar
cat("Bootstrap Estimate")

## Bootstrap Estimate

as.numeric(quantile(thetastar, c(0.025, 0.975)))

## [1] 0.00400 0.10605

x=523; n=1000; alpha=0.05
cat("MLE + Normal Theory")

## MLE + Normal Theory

round(2*x/n-1 + 
        c(-1, 1)*2*qnorm(1-alpha/2)*sqrt(x/n*(1-x/n)/n), 3)

## [1] -0.016  0.108

Notice that the answers are close but the botstrap solution has one advantage: the interval does not include negative values!

Example (8.2.2)

Recall the pet ownership and survival data:

	Status	Alive	Dead
1	Owns a Pet	50	3
2	Does not own a Pet	28	11

Question: is there a statistically significant association between Ownership and Survival?

First we need a measure of association, that is a number we can calculate from the data that tells us something about the relationship (or lack thereof) between our variables. Previously we used the chisquare statistic, so let’s use it again. Only now we don’t need the chisquare approximation, we can just study the distribution of the chisquare values for the bootstrap sample.

z <- c(50, 28, 3, 11)
A <- matrix(0, 92, 2)
A[79:92, 1] <- 1
A[c(51:78, 82:92), 2] <- 1
O <- c(table(A[, 1], A[, 2]))
E <- c((O[1]+O[2])*(O[1]+O[3]), 
       (O[1]+O[2])*(O[2]+O[4]), 
       (O[1]+O[3])*(O[3]+O[2]), 
       (O[2]+O[4])*(O[3]+O[4]))/92
chi <- sum((O-E)^2/E)
chistar <- rep(0, B)
for (i in 1:B) {
  a1 <- A[sample(1:92, size=92, replace=TRUE), 1]
  a2 <- A[sample(1:92, size=92, replace=TRUE), 2]
  O <- c(table(a1, a2)) 
  E <- c((O[1]+O[2])*(O[1]+O[3]), 
       (O[1]+O[2])*(O[2]+O[4]), 
       (O[1]+O[3])*(O[3]+O[2]), 
       (O[2]+O[4])*(O[3]+O[4]))/92
  chistar[i] <- sum((O-E)^2/E)    
}
bw <- diff(range(chistar))/50 
ggplot(data.frame(x=chistar), aes(x)) +
  geom_histogram(aes(y = ..density..),
    color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Density") +
  geom_vline(xintercept = chi)

length(chistar[chistar > chi])/B

## [1] 0.0045

Here we take bootstrap samples of the owns and of survives independently, calculate the corresponding chisquare statistics and repeat that B times. Finally we find the percentage of bootstrap runs larger than the observed chisquare, which is essentially the p-value of our test.

Example (8.2.3)

the effect of the mother’s cocaine use on the length of the newborn.

Let’s use the bootstrap to find 95% confidence intervals for the means and medians of the lengths.

A <- matrix(0, 3, 4)
Status <- c("Drug Free", "First Trimester", "Throughout")
dimnames(A) = list(Status, 
      c("Low Mean", "High Mean", "Low Median", "High Median"))
for (i in 1:3) {
  x <- mothers$Length[mothers$Status==Status[i]]
  thetastar <- bootstrap(x, B, mean)$thetastar
  A[i, 1:2] = round(quantile(thetastar, c(0.025, 0.975)), 2)
  thetastar <- bootstrap(x, B, median)$thetastar  
  A[i, 3:4] = round(quantile(thetastar, c(0.025, 0.975)), 2)
}
A <- round(A, 1)
kable.nice(A)

	Low Mean	High Mean	Low Median	High Median
Drug Free	50.2	52.0	50.5	51.7
First Trimester	48.2	50.4	48.7	51.0
Throughout	46.9	49.1	47.2	49.2

Example (8.2.4)

Let’s find a 95% confidence interval for the slope of the least squares regression line, fitting a no-intercept model. How do we do bootstrapping in a regression problem? The important thing is to resample the points:

head(hubble[, 1:2])

##   Velocity Distance
## 1      170    0.032
## 2      290    0.034
## 3     -130    0.214
## 4      -70    0.263
## 5     -185    0.275
## 6     -220    0.275

B <- 1000; alpha = 0.05
coef <- lm(Velocity~Distance-1, data=hubble)$coef
print("Slope for data")

## [1] "Slope for data"

print(round(as.numeric(coef), 1))

## [1] 423.9

slope.boot=rep(0, B)
for (i in 1:B) {
  Index <- sample(1:24, size=24, replace = TRUE)
  slope.boot[i]=lm(Velocity~Distance-1, data=hubble[Index, ])$coef
}
print("CI for Slope:")

## [1] "CI for Slope:"

print(round(as.numeric(quantile(slope.boot, c(0.025, 0.975))), 1))

## [1] 342.7 505.4

Example (8.2.5)

Consider Gregor Mendel’s pea experiment:

O <- c(315, 101, 108, 32)
B <- 1000
E <- sum(O)*c(9, 3, 3, 1)/16
chi <- sum((O-E)^2/E)
print("Chisquare Statistic of Data")

## [1] "Chisquare Statistic of Data"

print(round(chi), 3)

## [1] 0

chi.boot <- rep(0, B)
x <- rep(1:4, E)
for (i in 1:B) {
  xstar <- sample(x, size = 556, replace = TRUE)
  chi.boot[i] = sum((table(xstar) - E)^2/E)
}
print("% of Bootstrap runs > Data")

## [1] "% of Bootstrap runs > Data"

length(chi.boot[chi.boot > chi])/B

## [1] 0.921

Example (8.2.6)

Hidalgo stamps

A well known data set in statistics has the thicknesses (espesor) in millimeters of 485 Mexican stamps (sello) printed in 1872-1874, from the 1872 Hidalgo issue.

It is thought that the stamps from this issue are a “mixture” of different types of paper, of different thicknesses. Can we determine from the data how many different types of paper were used?

stamps[1:50]

##  [1] 0.060 0.064 0.064 0.065 0.066 0.068 0.069 0.069 0.069 0.069 0.069 0.069
## [13] 0.069 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070
## [25] 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070
## [37] 0.070 0.070 0.070 0.071 0.071 0.071 0.071 0.071 0.071 0.071 0.071 0.071
## [49] 0.071 0.071

Let’s start with a look at the data:

bw <- diff(range(stamps))/50 
df <- data.frame(Thickness=stamps)
ggplot(df, aes(Thickness)) +
geom_histogram(color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Counts")

which seems to have at least two modes. This judgment however is tricky because it depends on the number of bins we use.

An alternative is to use a frequency polygon

ggplot(df, aes(Thickness)) +
  geom_freqpoly()

which seems to suggest a much larger number of modes.

Let’s instead draw the graph using a nonparametric density estimate:

ggplot(df, aes(Thickness)) +
  stat_density(geom="line")

here it seems again like there are two modes, but this depends largely on the chosen bandwidth:

pushViewport(viewport(layout = grid.layout(1, 2)))
print(ggplot(df, aes(Thickness)) +
   stat_density(geom="line", bw=0.01)  ,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))        
print(ggplot(df, aes(Thickness)) +
   stat_density(geom="line", bw=0.001) +ylab("")  ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))

stat_density implements a kernel density estimator. In what follows we will need to explicitly calculate these estimates and so we use the density routine.

From the above it is clear that the number of modes depends on the choice of bandwidth h. It is clear that the number of modes is a non-increasing function of h. At the extremes we would have a simple normal distribution with one mode (h large) and on the other a sharply peaked mode at each observation (h tiny).

Let’s say we want to test \[ H_0: \text{number of modes}=1 \text{ vs. } H_1: \text{number of modes}>1 \] Because the number of modes is a non-increasing function of h there exists an \(h_1\) such that the density estimator has one mode for \(h<h_1\) and two or more modes for \(h>h_1\). Let’s do a little trial and error:

fhat <- function(h, t, dta=stamps) {
  tmp <- density(dta, bw=h)
  df <- data.frame(x=tmp$x, y=tmp$y)
  if(missing(t)) return(df)
  out <- approx(df, xout=t)$y   
  out[!is.na(out)]
}
draw.fhat <- function(h)
  ggplot(fhat(h), aes(x, y)) + geom_line()

pushViewport(viewport(layout = grid.layout(2, 2)))
print(draw.fhat(0.01) ,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(draw.fhat(0.005) ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))        
print(draw.fhat(0.0075) ,
  vp=viewport(layout.pos.row=2, layout.pos.col=1))
print(draw.fhat(0.0068) ,
  vp=viewport(layout.pos.row=2, layout.pos.col=2))

and we find \(h_1 \approx 0.0068\).

Is there a way to calculate the number of modes for a given h? Here is one:

calculate \(y_i=\hat{f}(t_i;h)\) on a grid \(t_1,..t_k\)
calculate \(z_i=y_{i-1}-y_i\) and note that at a mode z will change from positive to negative
number of modes = \(\sum I[z_i>0 \text{ and } z_{i+1}<0]\)

Let’s write a simple routine that automates the process. It uses a bisection algorithm.

x.points <- seq(min(stamps), max(stamps), length = 250)
calc.num.mode = function(y) {
  m <- length(y) - 1
  z <- diff(y)
  sum(ifelse(z[-m] >= 0 & z[-1] < 0, 1, 0))
}
find.h <- function(num.modes, h=0.007, Show=FALSE) {
  repeat {
    h <- h-0.001
    if(Show)
      cat("h =", h, " modes=", 
          calc.num.mode(fhat(h, x.points)), "\n")
    if(calc.num.mode(fhat(h, x.points)) >= num.modes) break
  }
  low <- h
  high <- h + 0.001
  repeat {
    h <- (low+high)/2
    if(Show)
      cat("h =", h, " modes=", 
          calc.num.mode(fhat(h, x.points)), "\n")
    if(calc.num.mode(fhat(h, x.points)) < num.modes) 
      high <- h
    else 
      low <- h
    if(high-low<10^-7) 
      break
  }
  h
}

h1 <- find.h(1, Show = TRUE)

## h = 0.006  modes= 2 
## h = 0.0065  modes= 2 
## h = 0.00675  modes= 1 
## h = 0.006875  modes= 1 
## h = 0.0069375  modes= 1 
## h = 0.00696875  modes= 1 
## h = 0.006984375  modes= 1 
## h = 0.006992188  modes= 1 
## h = 0.006996094  modes= 1 
## h = 0.006998047  modes= 1 
## h = 0.006999023  modes= 1 
## h = 0.006999512  modes= 1 
## h = 0.006999756  modes= 1 
## h = 0.006999878  modes= 1 
## h = 0.006999939  modes= 1

h5 <- find.h(5)

pushViewport(viewport(layout = grid.layout(1, 2)))
print(draw.fhat(h1) ,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(draw.fhat(h5) ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))

So, how can we test

\[ H_0: \text{number of modes}=1 \text{ vs. } H_1: \text{number of modes}>1 \]

Here is the idea:

draw B bootstrap samples of size n from \(\hat{f}(h_1)\)
for each find \(h_1^{*}\), the smallest h for which this bootstrap sample has just 1 mode
approximate p-value of test is the proportion of \(h_1^{*} > h_1\).

the idea is this: if there is indeed just one mode, then in the bootstrap samples \(h_1^{*}\) should be around \(h_1\) and so this proportions shouldn’t be to small.

Notice we don’t actually need \(h_1^{*}\), we just need to check if \(h_1^{*}>h_1\), which is the case if \(\hat{f}(x^{*}; h_1^{*})\) has at least two modes.

Note that we are not drawing bootstrap samples from “stamps” but from a density estimate, \(\hat{f}\). So this is an example of what is called the smooth bootstrap.

How do we draw from \(\hat{f}\)? It can be shown that if \(y_1^{*},.., y_n^{*}\) is a bootstrap sample from the data, then a smooth bootstrap sample is given by

\[ x_i^{*}=\bar{y^{*}} +(1+h_1^{*}/s^2)^{-1/2}(y_i^{*}-\bar{y^{*}}+h_1^{*}\epsilon_i) \] where \(\epsilon_i \sim N(0,1)\)

test.modes <- function(k) {
  h <- find.h(k+1)
  q <- 1/sqrt((1 + h^2/var(stamps)))
  B <- 1000
  hstar <- rep(0, B)
  for (i in 1:B) {
    ystar <- sample(stamps, size = 485, replace = TRUE)
    xstar <- mean(ystar) + q*(ystar-mean(ystar) + 
                h*rnorm(485))
    y <- fhat(h, x.points, dta=xstar)
    if (calc.num.mode(y) > k) 
        hstar[i] <- 1
  }
  length(hstar[hstar > h])/B  
}
test.modes(1)

## [1] 0.001

and so we find strong evidence against the null, there are more than one mode.

The same method works for testing

\[ H_0: \text{number of modes}=k \text{ vs. } H_1: \text{number of modes}>k \]

and we find

for(k in 2:9) 
  cat("k =", k, ", p =", test.modes(k),"\n")

## k = 2 , p = 0.309 
## k = 3 , p = 0.065 
## k = 4 , p = 0.006 
## k = 5 , p = 0 
## k = 6 , p = 0 
## k = 7 , p = 0.351 
## k = 8 , p = 0.777 
## k = 9 , p = 0.538

So there are certainly more than one mode, with a chance for as many 7.