We will discuss this topic in the context of regression analysis, it is however much more general and arises in almost all areas of Statistics. The solutions discussed here will generally also work in other areas.
Let’s test for all predictors in the houseprice data. In fact, let’s write an R routine that does this in general:
test.all.predictors=function(y, X, alpha=0.05) {
n=nrow(X)
k=ncol(X)
X=cbind(1, X)
y=cbind(y)
G=solve(t(X)%*%X)
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
sse = t(y)%*%(diag(n)-X%*%G%*%t(X))%*%y/(n-k-1)
TS=c(abs(betahat)/sqrt(sse*diag(G)))[-1]
names(TS)=colnames(X)[-1]
list(TS=TS, crit=qt(1-alpha/2, n-k-1))
}
test.all.predictors(houseprice[, 1], as.matrix(houseprice[, -1]))
## $TS
## Sqfeet Floors Bedrooms Baths
## 7.965553 2.791823 1.359653 3.047909
##
## $crit
## [1] 2.068658
and so we reject all the null hypotheses except the one for Bedrooms, all other predictors are useful.
There is however a problem. Let’s consider a situation where in fact none of the predictors is useful, so all the null hypotheses are true. Let’s do a little simulation to see what would happen:
B=1000
n=20
TS=matrix(0, B, 4)
x=seq(0, 1, length=n)
X=matrix(runif(4*n), n, 4)
for(i in 1:B) {
y=rnorm(n) # y does not depend on any of the predictors
TS[i, ]=test.all.predictors(y, X)$TS
}
crit=qt(0.975, n-4-1)
for(i in 1:4) print(sum(abs(TS[, i])> crit)/B)
## [1] 0.055
## [1] 0.053
## [1] 0.05
## [1] 0.046
and so this seems to work fine. However consider the following question: how often do we reject at least on of the four hypotheses? Well
TSmax=apply(TS, 1, max)
sum((TSmax>crit))/B
## [1] 0.172
In other words, if we applied this idea in practice, we would declare at least one predictor to be useful 17.2% of the time, even though none of them is.
The problem is a simple one: if you pick a card from a standard deck, the probability to pick the Ace of hearts is small (1/52) but if you pick a card, look at it, put the card back, shuffle, pick another card, look at it and so on, sooner or later you will find the Ace of Hearts!
Say we carry out k hypothesis tests, each at the \(\alpha\) level of significance. Then the probability to reject at least one null although they are all true is called the familywise error rate, denoted by \(\alpha_f\). \(\alpha\) is then called the comparisonwise error rate \(\alpha_c\).
What can be said about \(\alpha_f\)? Here is one case
If the k tests are independent we have \(\alpha_f=1-(1-\alpha_c)^k\)
proof
Let the event test i rejects null hypothesis be denoted by Ti, then
\[ \begin{aligned} &\alpha_f=P(\text{at least one test rejects null | all nulls are true}) = \\ &1-P(\text{none of the tests rejects null | all nulls are true}) = \\ &1-P(\cap_{i=1}^k\{T_i^c\}) = \\ &1-\prod_{i=1}^kP(T_i^c) = \text{ \{by independence\}}\\ &1-P(T_1^c)^k = \\ &1-(1-P(T_1))^k = \\ &1-(1-\alpha_c)^k \end{aligned} \]
So in this case the solution would be easy: if we want a familywise error rate of \(\alpha_f\) we should do each test at \(1-(1-\alpha_c)^{1/k}\).
The problem is that in our case (and in almost all real live cases) the tests are not independent! In fact they are all using the same data set. But we do have:
\[\alpha_f\le k\alpha_c\]
proof
\[ \begin{aligned} &\alpha_f=P(\text{at least one test rejects null | all nulls are true}) = \\ &P(\cup_{i=1}^k\{T_i\}) \le \text{\{Bonferroni Inequality\}}\\ &\sum_{i=1}^k P(T_i) = k\alpha_c\\ \end{aligned} \]
So even if the tests are not independent, doing each test at an \(\alpha/k\) level insures that \(\alpha_f\) is at most \(\alpha\). This is called the Bonferroni approach.
On the other hand, if the tests are fairly dependent this method is quite conservative, that is the true \(\alpha_f\) can be much smaller than \(\alpha\). This in turn will lead to a test with low power.
There are a number of modifications to the Bonferroni approach that have been proposed that are not as conservative, The best known of these is the method known as Holm-Bonferroni.
A second approach to the multiple testing problem is due to Scheffe. Say we want to test
\[H_0: \pmb{a'\beta}=0\]
As we saw in (6.6.14) the test is based on the test statistic
\[F=\frac{(\pmb{a'\hat{\beta}})^2}{s^2\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}}\]
The test rejects the null hypothesis if F is greater than some critical value. The idea now is to find critical values that work for all possible vectors \(\pmb{a}\). It is therefore necessary to find the distribution of \(\max_{\pmb{a}}F\).
\[F_M=\max_{\pmb{a}} \left[\frac{(\pmb{a'\hat{\beta}})^2}{s^2\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}}\right]=\frac{\pmb{\hat{\beta} X'X\hat{\beta}}}{s^2}\]
proof
\[ \begin{aligned} &\frac{\partial }{\partial a} \frac{(\pmb{a'\hat{\beta}})^2}{\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}} = \\ &\frac{\frac{\partial }{\partial a}(\pmb{a'\hat{\beta}})^2(\pmb{a}'(\pmb{X'X})^{-1}\pmb{a})-(\pmb{a'\hat{\beta}})^2\frac{\partial }{\partial a}(\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}) }{(\pmb{a}'(\pmb{X'X})^{-1}\pmb{a})^2} =\pmb{0}\\ \end{aligned} \]
now from (4.3.20) we have
\[\frac{\partial }{\partial a}(\pmb{a'\hat{\beta}})^2 =2(\pmb{a'\hat{\beta}})\frac{\partial }{\partial a} (\pmb{a'\hat{\beta}}) =2(\pmb{a'\hat{\beta}} )\pmb{\hat{\beta}}\]
and from (4.3.21) we have
\[\frac{\partial }{\partial a}(\pmb{a}'(\pmb{X'X})^{-1}\pmb{a})=2(\pmb{X'X})^{-1}\pmb{a}\]
and so we have
\[(\pmb{a'\hat{\beta}} )\pmb{\hat{\beta}}(\pmb{a}'(\pmb{X'X})^{-1}\pmb{a})-(\pmb{a'\hat{\beta}})^2(\pmb{X'X})^{-1}\pmb{a}=\pmb{0}\]
\(\pmb{a'\hat{\beta}}\) is a number, so this is equivalent to
\[[\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}]\pmb{\hat{\beta}}-\pmb{a'\hat{\beta}}(\pmb{X'X})^{-1}\pmb{a}=\pmb{0}\]
\[\pmb{a'\hat{\beta}}(\pmb{X'X})^{-1}\pmb{a}=[\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}]\pmb{\hat{\beta}}\]
\[(\pmb{X'X})^{-1}\pmb{a}=\frac{\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}}{\pmb{a'\hat{\beta}}}\pmb{\hat{\beta}}\]
and finally
\[\pmb{a}= \frac{\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}}{\pmb{a'\hat{\beta}}}\pmb{X'X}\pmb{\hat{\beta}}=: c\pmb{X'X}\pmb{\hat{\beta}}\]
substituting this back we get the result
So, to test \(H_0: \pmb{a'\beta}=0\) for any \(\pmb{a}\) with \(\alpha_f\le \alpha\) find
\[F=\frac{(\pmb{a'\hat{\beta}})^2}{s^2\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}}\]
and reject the null hypothesis if \(F\ge (k+1)F_{\alpha, k+1, n-k-1}\).
To test for individual \(\beta's\) using Scheffe’s method we use \(\pmb{a}=(0,.., 1, 0, ..0)'\). Then as before the test reduces to the t test
\[t=\frac{\hat{\beta}_j}{s\sqrt{g_{jj}}}\]
and we reject the null hypothesis if \(|t|\ge\sqrt{(k+1)F_{\alpha, k+1, n-k-1}}\).
If k is small when compared to n we have
\[t_{\alpha/2k, n-k-1}<\sqrt{(k+1)F_{\alpha, k+1, n-k-1}}\]
and so the Bonferroni test for individual predictors is more powerful than Scheffe’s test.
Let’s apply both methods to the houseprice data. We already found the test statistics:
TS.house=test.all.predictors(houseprice[, 1], as.matrix(houseprice[, -1]))$TS
Now for Bonferroni the critical value would be
qt(1-0.05/(2*4), 28-4-1)
## [1] 2.70972
TS.house>qt(1-0.05/(2*4), 28-4-1)
## Sqfeet Floors Bedrooms Baths
## TRUE TRUE FALSE TRUE
and so we find Bedrooms to not be statistically significant
With Scheffe’s method:
sqrt((4+1)*qf(1-0.05, 4+1, 28-4-1))
## [1] 3.63318
TS.house>sqrt((4+1)*qf(1-0.05, 4+1, 28-4-1))
## Sqfeet Floors Bedrooms Baths
## TRUE FALSE FALSE FALSE
and only SqFeet is found significant.
let’s return for a moment to example (6.7.1). There we ran a simulation and collected all the test statistics. Then we found the largest of each (in absolute value), that is the test statistic most likely to be larger than the critical value and leading to a rejection of a null. This is what they looked like:
df=data.frame(x=TSmax)
bw <- diff(range(x))/50
ggplot(df, aes(x)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density")
But if these were values of test statistics from real experiments, and knowing that in fact all null hypotheses were true, we could just find critical values from it:
crit=quantile(TSmax, 0.95)
TS.house>crit
## Sqfeet Floors Bedrooms Baths
## TRUE TRUE FALSE TRUE
and using these critical values the test would have correct familywise error rate by design.
As discussed before, both Bonferroni and Scheffe are conservative methods, that is they usually lead to a \(\alpha_f\) smaller than desired, and therefore also to a smaller power. The MC method however achieves \(\alpha_f\) exactly.
This approach has another advantage: neither the Bonferroni method (as described above) nor Scheffe’s method give us a p-value, which is what is generally quoted when doing a hypothesis test. The MC method can be used to do just that.
To see how let’s return to the case were all the tests are independent. Let’s denote by \(P_i\) the p-value of the ith test and by \(P_{min}=\min \left\{P_i \right\}\), the smallest p-value which would correspond to the test most likely to be rejected. As always p values are random variables, and therefore so is \(P_{min}\). What is its distribution? First recall that for any correct hypothesis test (for a continuous parameter) the distribution of the p-value has to be U[0,1], we have \(P(P_i<x)=x\) for \(0<x<1\).
\[ \begin{aligned} &F_{P_{min}}(x)=P(P_{min}<x|\text{all nulls are true}) = \\ &1-P(P_{min}>x ) = \\ &1-P(\cap_{i=1}^k\{P_i>x\}) = \\ &1-\prod_{i=1}^kP(P_i>x) = \text{ \{by independence\}}\\ &1-P(P_1>x)^k = \\ &1-(1-P(P_1<x))^k = \\ &1-(1-x)^k \end{aligned} \]
Next we can make use of the following theorem from probability theory
(Probability Integral Transform)
Let \(X\sim F\) be a continuous random variable, then \(F(X)\sim U[0,1]\).
proof see any probability theory textbook
So we find that
\[P^{*}_{min} = 1-(1-P_{min})^k\sim U[0,1]\]
But in our case we do not have independence! So we do not know the distribution of \(P_{min}\). However, we do have data from our simulation, and we can use that to find first observations from \(P_{min}\) and then its distribution:
pmin=rep(0, B)
tmp=rep(0, 4)
for(i in 1:B) {
xsim=TS[sample(1:B, 1), ]
for(k in 1:4) tmp[k] <- sum(xsim[k] < TS[ ,k])/B
pmin[i] <- min(tmp)
}
Here is a histogram of the simulated \(P_{min}\) values:
df=data.frame(x=pmin)
bw <- 1/50
ggplot(df, aes(x)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density")
This we can do by finding the empirical distribution function, defined by
\[\hat{F}(x)=\frac{\# X's\le x}{\# of X's}\].
x=seq(0, 1, length=1000)
y=x
for(i in 1:1000 ) {
y[i]=sum(pmin<=x[i])/B
}
Here is a graph that shows this distribution function as well as the Bonferroni one and the distribution for completely dependent test, which is just F(x)=x:
df=data.frame(x=c(x, x, x),
y=c(1-(1-x)^4, y, x),
Method=rep(c("Bonferroni", "MC", "Completely Dep."), each=B))
ggplot(data=df, aes(x, y, color=Method)) +
geom_point()
In this example the Bonferroni curve is the same as the MC curve because we did the simulation under the assumption of independence. In general the MC curve will be between the other two.
More details of this method can be found in Buja and Rolke, Calibration for Simultaneity: (Re) Sampling Methods for Simultaneous Inference with Applications to Function Estimation and Functional Data.