library(bootstrap)
B <- 2000
Let’s revisit some of the examples we discussed earlier, and analyze them using the bootstrap:
In a survey of 1000 likely voters 523 said they will vote for party A, the other 477 for party B. Find a 95% CI for the lead of one party over the other.
dta <- c(rep(1, 523), rep(0, 477))
s <- function(x) abs(sum(x) - sum(1 - x))/length(x)
lead <- s(dta)
thetastar <- bootstrap(dta, B, s)$thetastar
cat("Bootstrap Estimate")
## Bootstrap Estimate
as.numeric(quantile(thetastar, c(0.025, 0.975)))
## [1] 0.00400 0.10605
x=523; n=1000; alpha=0.05
cat("MLE + Normal Theory")
## MLE + Normal Theory
round(2*x/n-1 +
c(-1, 1)*2*qnorm(1-alpha/2)*sqrt(x/n*(1-x/n)/n), 3)
## [1] -0.016 0.108
Notice that the answers are close but the botstrap solution has one advantage: the interval does not include negative values!
Recall the pet ownership and survival data:
Status | Alive | Dead | |
---|---|---|---|
1 | Owns a Pet | 50 | 3 |
2 | Does not own a Pet | 28 | 11 |
Question: is there a statistically significant association between Ownership and Survival?
First we need a measure of association, that is a number we can calculate from the data that tells us something about the relationship (or lack thereof) between our variables. Previously we used the chisquare statistic, so let’s use it again. Only now we don’t need the chisquare approximation, we can just study the distribution of the chisquare values for the bootstrap sample.
z <- c(50, 28, 3, 11)
A <- matrix(0, 92, 2)
A[79:92, 1] <- 1
A[c(51:78, 82:92), 2] <- 1
O <- c(table(A[, 1], A[, 2]))
E <- c((O[1]+O[2])*(O[1]+O[3]),
(O[1]+O[2])*(O[2]+O[4]),
(O[1]+O[3])*(O[3]+O[2]),
(O[2]+O[4])*(O[3]+O[4]))/92
chi <- sum((O-E)^2/E)
chistar <- rep(0, B)
for (i in 1:B) {
a1 <- A[sample(1:92, size=92, replace=TRUE), 1]
a2 <- A[sample(1:92, size=92, replace=TRUE), 2]
O <- c(table(a1, a2))
E <- c((O[1]+O[2])*(O[1]+O[3]),
(O[1]+O[2])*(O[2]+O[4]),
(O[1]+O[3])*(O[3]+O[2]),
(O[2]+O[4])*(O[3]+O[4]))/92
chistar[i] <- sum((O-E)^2/E)
}
bw <- diff(range(chistar))/50
ggplot(data.frame(x=chistar), aes(x)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
geom_vline(xintercept = chi)
length(chistar[chistar > chi])/B
## [1] 0.0045
Here we take bootstrap samples of the owns and of survives independently, calculate the corresponding chisquare statistics and repeat that B times. Finally we find the percentage of bootstrap runs larger than the observed chisquare, which is essentially the p-value of our test.
the effect of the mother’s cocaine use on the length of the newborn.
Let’s use the bootstrap to find 95% confidence intervals for the means and medians of the lengths.
A <- matrix(0, 3, 4)
Status <- c("Drug Free", "First Trimester", "Throughout")
dimnames(A) = list(Status,
c("Low Mean", "High Mean", "Low Median", "High Median"))
for (i in 1:3) {
x <- mothers$Length[mothers$Status==Status[i]]
thetastar <- bootstrap(x, B, mean)$thetastar
A[i, 1:2] = round(quantile(thetastar, c(0.025, 0.975)), 2)
thetastar <- bootstrap(x, B, median)$thetastar
A[i, 3:4] = round(quantile(thetastar, c(0.025, 0.975)), 2)
}
A <- round(A, 1)
kable.nice(A)
Low Mean | High Mean | Low Median | High Median | |
---|---|---|---|---|
Drug Free | 50.2 | 52.0 | 50.5 | 51.7 |
First Trimester | 48.2 | 50.4 | 48.7 | 51.0 |
Throughout | 46.9 | 49.1 | 47.2 | 49.2 |
Let’s find a 95% confidence interval for the slope of the least squares regression line, fitting a no-intercept model. How do we do bootstrapping in a regression problem? The important thing is to resample the points:
head(hubble[, 1:2])
## Velocity Distance
## 1 170 0.032
## 2 290 0.034
## 3 -130 0.214
## 4 -70 0.263
## 5 -185 0.275
## 6 -220 0.275
B <- 1000; alpha = 0.05
coef <- lm(Velocity~Distance-1, data=hubble)$coef
print("Slope for data")
## [1] "Slope for data"
print(round(as.numeric(coef), 1))
## [1] 423.9
slope.boot=rep(0, B)
for (i in 1:B) {
Index <- sample(1:24, size=24, replace = TRUE)
slope.boot[i]=lm(Velocity~Distance-1, data=hubble[Index, ])$coef
}
print("CI for Slope:")
## [1] "CI for Slope:"
print(round(as.numeric(quantile(slope.boot, c(0.025, 0.975))), 1))
## [1] 342.7 505.4
Consider Gregor Mendel’s pea experiment:
O <- c(315, 101, 108, 32)
B <- 1000
E <- sum(O)*c(9, 3, 3, 1)/16
chi <- sum((O-E)^2/E)
print("Chisquare Statistic of Data")
## [1] "Chisquare Statistic of Data"
print(round(chi), 3)
## [1] 0
chi.boot <- rep(0, B)
x <- rep(1:4, E)
for (i in 1:B) {
xstar <- sample(x, size = 556, replace = TRUE)
chi.boot[i] = sum((table(xstar) - E)^2/E)
}
print("% of Bootstrap runs > Data")
## [1] "% of Bootstrap runs > Data"
length(chi.boot[chi.boot > chi])/B
## [1] 0.921
Hidalgo stamps
A well known data set in statistics has the thicknesses (espesor) in millimeters of 485 Mexican stamps (sello) printed in 1872-1874, from the 1872 Hidalgo issue.
It is thought that the stamps from this issue are a “mixture” of different types of paper, of different thicknesses. Can we determine from the data how many different types of paper were used?
stamps[1:50]
## [1] 0.060 0.064 0.064 0.065 0.066 0.068 0.069 0.069 0.069 0.069 0.069 0.069
## [13] 0.069 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070
## [25] 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070 0.070
## [37] 0.070 0.070 0.070 0.071 0.071 0.071 0.071 0.071 0.071 0.071 0.071 0.071
## [49] 0.071 0.071
Let’s start with a look at the data:
bw <- diff(range(stamps))/50
df <- data.frame(Thickness=stamps)
ggplot(df, aes(Thickness)) +
geom_histogram(color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Counts")
which seems to have at least two modes. This judgment however is tricky because it depends on the number of bins we use.
An alternative is to use a frequency polygon
ggplot(df, aes(Thickness)) +
geom_freqpoly()
which seems to suggest a much larger number of modes.
Let’s instead draw the graph using a nonparametric density estimate:
ggplot(df, aes(Thickness)) +
stat_density(geom="line")
here it seems again like there are two modes, but this depends largely on the chosen bandwidth:
pushViewport(viewport(layout = grid.layout(1, 2)))
print(ggplot(df, aes(Thickness)) +
stat_density(geom="line", bw=0.01) ,
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(df, aes(Thickness)) +
stat_density(geom="line", bw=0.001) +ylab("") ,
vp=viewport(layout.pos.row=1, layout.pos.col=2))
stat_density implements a kernel density estimator. In what follows we will need to explicitly calculate these estimates and so we use the density routine.
From the above it is clear that the number of modes depends on the choice of bandwidth h. It is clear that the number of modes is a non-increasing function of h. At the extremes we would have a simple normal distribution with one mode (h large) and on the other a sharply peaked mode at each observation (h tiny).
Let’s say we want to test \[ H_0: \text{number of modes}=1 \text{ vs. } H_1: \text{number of modes}>1 \] Because the number of modes is a non-increasing function of h there exists an \(h_1\) such that the density estimator has one mode for \(h<h_1\) and two or more modes for \(h>h_1\). Let’s do a little trial and error:
fhat <- function(h, t, dta=stamps) {
tmp <- density(dta, bw=h)
df <- data.frame(x=tmp$x, y=tmp$y)
if(missing(t)) return(df)
out <- approx(df, xout=t)$y
out[!is.na(out)]
}
draw.fhat <- function(h)
ggplot(fhat(h), aes(x, y)) + geom_line()
pushViewport(viewport(layout = grid.layout(2, 2)))
print(draw.fhat(0.01) ,
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(draw.fhat(0.005) ,
vp=viewport(layout.pos.row=1, layout.pos.col=2))
print(draw.fhat(0.0075) ,
vp=viewport(layout.pos.row=2, layout.pos.col=1))
print(draw.fhat(0.0068) ,
vp=viewport(layout.pos.row=2, layout.pos.col=2))
and we find \(h_1 \approx 0.0068\).
Is there a way to calculate the number of modes for a given h? Here is one:
Let’s write a simple routine that automates the process. It uses a bisection algorithm.
x.points <- seq(min(stamps), max(stamps), length = 250)
calc.num.mode = function(y) {
m <- length(y) - 1
z <- diff(y)
sum(ifelse(z[-m] >= 0 & z[-1] < 0, 1, 0))
}
find.h <- function(num.modes, h=0.007, Show=FALSE) {
repeat {
h <- h-0.001
if(Show)
cat("h =", h, " modes=",
calc.num.mode(fhat(h, x.points)), "\n")
if(calc.num.mode(fhat(h, x.points)) >= num.modes) break
}
low <- h
high <- h + 0.001
repeat {
h <- (low+high)/2
if(Show)
cat("h =", h, " modes=",
calc.num.mode(fhat(h, x.points)), "\n")
if(calc.num.mode(fhat(h, x.points)) < num.modes)
high <- h
else
low <- h
if(high-low<10^-7)
break
}
h
}
h1 <- find.h(1, Show = TRUE)
## h = 0.006 modes= 2
## h = 0.0065 modes= 2
## h = 0.00675 modes= 1
## h = 0.006875 modes= 1
## h = 0.0069375 modes= 1
## h = 0.00696875 modes= 1
## h = 0.006984375 modes= 1
## h = 0.006992188 modes= 1
## h = 0.006996094 modes= 1
## h = 0.006998047 modes= 1
## h = 0.006999023 modes= 1
## h = 0.006999512 modes= 1
## h = 0.006999756 modes= 1
## h = 0.006999878 modes= 1
## h = 0.006999939 modes= 1
h5 <- find.h(5)
pushViewport(viewport(layout = grid.layout(1, 2)))
print(draw.fhat(h1) ,
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(draw.fhat(h5) ,
vp=viewport(layout.pos.row=1, layout.pos.col=2))
So, how can we test
\[ H_0: \text{number of modes}=1 \text{ vs. } H_1: \text{number of modes}>1 \]
Here is the idea:
the idea is this: if there is indeed just one mode, then in the bootstrap samples \(h_1^{*}\) should be around \(h_1\) and so this proportions shouldn’t be to small.
Notice we don’t actually need \(h_1^{*}\), we just need to check if \(h_1^{*}>h_1\), which is the case if \(\hat{f}(x^{*}; h_1^{*})\) has at least two modes.
Note that we are not drawing bootstrap samples from “stamps” but from a density estimate, \(\hat{f}\). So this is an example of what is called the smooth bootstrap.
How do we draw from \(\hat{f}\)? It can be shown that if \(y_1^{*},.., y_n^{*}\) is a bootstrap sample from the data, then a smooth bootstrap sample is given by
\[ x_i^{*}=\bar{y^{*}} +(1+h_1^{*}/s^2)^{-1/2}(y_i^{*}-\bar{y^{*}}+h_1^{*}\epsilon_i) \] where \(\epsilon_i \sim N(0,1)\)
test.modes <- function(k) {
h <- find.h(k+1)
q <- 1/sqrt((1 + h^2/var(stamps)))
B <- 1000
hstar <- rep(0, B)
for (i in 1:B) {
ystar <- sample(stamps, size = 485, replace = TRUE)
xstar <- mean(ystar) + q*(ystar-mean(ystar) +
h*rnorm(485))
y <- fhat(h, x.points, dta=xstar)
if (calc.num.mode(y) > k)
hstar[i] <- 1
}
length(hstar[hstar > h])/B
}
test.modes(1)
## [1] 0.001
and so we find strong evidence against the null, there are more than one mode.
The same method works for testing
\[ H_0: \text{number of modes}=k \text{ vs. } H_1: \text{number of modes}>k \]
and we find
for(k in 2:9)
cat("k =", k, ", p =", test.modes(k),"\n")
## k = 2 , p = 0.309
## k = 3 , p = 0.065
## k = 4 , p = 0.006
## k = 5 , p = 0
## k = 6 , p = 0
## k = 7 , p = 0.351
## k = 8 , p = 0.777
## k = 9 , p = 0.538
So there are certainly more than one mode, with a chance for as many 7.