After the basic ANOVA test and rejecting the null hypothesis of no differences one generally wants to go a step further and investigate what the differences are.

Example

We rejected the null hypothesis of equal lengths of the babies in the mother and cocaine use experiment. But

  • is there a stat. signif. difference between drug free and 1st trimester?
  • is there a stat. signif. difference between 1st trimester and throughout?

Notice that in this case we have a natural ordering of the levels, so there are two pairs of interest. In general if a factor has k levels there are \(n\choose k\) pairs.

Comparing two groups could be done with the two-sample t test, but doing \(n\choose k\) tests we again have the issue of simultaneous inference, as discussed in section 6.7. The same general solutions proposed there (Bonferroni’s method, Scheffe’s method and p-value adjustment via simulation) can be applied here as well. In addition we have some methods specifically designed for pairwise comparisons in ANOVA.

Note an often made suggestion is to do pairwise comparisons at the \(\alpha=10\%\) level rather than \(5\%\).

Bonferroni’s Method

This method uses the basic 2-sample t test but with the adjusted significance level \(\alpha/k\) if k comparisons are done.

Sheffe’s Method

We already discussed this method in the context of regression in (6.5.2). There we saw that all hypotheses of the form \(H_0: \pmb{a'\beta}=0\) can be tested simultaneously with

\[F_M=\frac{(\pmb{a'\hat{\beta}})^2}{s^2\pmb{a}'(\pmb{X'X})^{-1}\pmb{a}}\]

using \(F/(k+1)\sim F_M(k+1,n-k-1)\) if k tests are done.

A pairwise comparison is a contrast of the form \(\pmb{a} = \begin{pmatrix} 0 & ... & 0 & 1 & 0 &... & 0 & -1 & 0 &... &0 \end{pmatrix}\), so \(\pmb{a'\hat{\beta}=\bar{y}_i-\bar{y}_j}\) and

\[ \pmb{a}'(\pmb{X'X})^{-}\pmb{a} =\\ \pmb{a}' \begin{pmatrix} 0 & 0 & 0 & 0 & ... & 0\\ 0 & 1/n_1 & 0 & 0 & ... & 0\\ 0 & 0 & 1/n_2 & 0 & ... & 0\\ \vdots&\vdots&\vdots&\vdots&...&\vdots \\ 0 & 0 & 0 &0 &... & 1/n_k \end{pmatrix} \begin{pmatrix} 0 \\ \vdots \\ 0 \\ 1 \\ 0 \\ \vdots \\ 0 \\ -1 \\ 0 \\ \vdots \\0 \end{pmatrix}=\\ \begin{pmatrix} 0 & ... & 0 & 1 & 0 &... & 0 & -1 & 0 &... &0 \end{pmatrix} \begin{pmatrix} 0 \\ \vdots \\ 0 \\ 1/n_1 \\ 0 \\ \vdots \\ 0 \\ -1/n_j \\ 0 \\ \vdots \\0\end{pmatrix}= 1/n_i+1/n_j \] and so

\[F_M=\frac{(\bar{y}_i-\bar{y}_j)^2}{s^2\left(1/n_i+1/n_j\right)}\]

Example

In the mothers and cocaine use data we find:

y=mothers$Length
x=mothers$Status
k=3
ni=table(x)
n=sum(ni)
y..=sum(y)
yi.=tapply(y, x, sum)
sse=sum(y^2)-sum(yi.^2/ni)
s2=sse/(n-k)
F12= (yi.[2]/ni[2]-yi.[1]/ni[1])^2/s2/((1/ni[2]+1/ni[1]))
F23= (yi.[3]/ni[3]-yi.[2]/ni[2])^2/s2/((1/ni[3]+1/ni[2]))
FTS=c(F12, F23)
names(FTS)=c("Drug Free - First Trimester", "First Trimester - Throughout")
round(1-pf(FTS, 1, n-k), 3)
##  Drug Free - First Trimester First Trimester - Throughout 
##                        0.042                        0.145

and so we find a difference between Drug Free and First Trimester but not between First Trimester and Throughout, although this is almost certainly due to the small sample sizes.

Adjusted p value

The idea is to find the distribution of the smallest p value via simulation and then adjust the actual p values accordingly:

B=10000
mu=mean(y)
sig=sd(y)
TS=function(x,y) t.test(x,y)$p.value
pvals = matrix(0, B, 3)
for(i in 1:B) {
   pvals[i, 1]=TS(rnorm(ni[1], mu, sig),rnorm(ni[2], mu, sig))
   pvals[i, 2]=TS(rnorm(ni[2], mu, sig),rnorm(ni[3], mu, sig))
}
pvals[, 3]=apply(pvals[, 1:2], 1, min)
par(mfrow=c(2, 2))   
for(i in 1:3) hist(pvals[,i], 100, freq=FALSE, main="")

so we see that while the p values of the individual tests have a uniform [0,1] distribution, the minimum p-value does not. We can now use the empirical distribution function to adjust the p-values:

data.pvals=c(TS(y[x=="Drug Free"], y[x=="First Trimester"]),
             TS(y[x=="Throughout"], y[x=="First Trimester"]))
data.pvals
## [1] 0.01917369 0.12397785
adj.pvals=ecdf(pvals[, 3])(data.pvals)
names(adj.pvals)=c("Drug Free - First Trimester", "First Trimester - Throughout")
adj.pvals
##  Drug Free - First Trimester First Trimester - Throughout 
##                       0.0388                       0.2407

and so again we find a difference between Drug Free and First Trimester but not between First Trimester and Throughout

Fisher’s LSD Method

LSD stands for least significant difference. Say we wish to compare groups i and j. Now \(\bar{y}_{i.}\) and \(\bar{y}_{j.}\) are the respective group means and \(s^2=\text{SSE}/(n_1+n_2-2)\), then the test statistic

\[T= \frac{|\bar{y}_{i.}-\bar{y}_{j.}|}{s\sqrt{1/n_1+1/n_2}}\sim t(1-\alpha/2, n_1+n_2-2)\] this of course is just the standard 2-sample t test for the difference in means, assuming equal variance. Fisher’s LSD method therefore does not provide an experiment-wise error rate, but is guards against to many false positives because it is run only after the F test rejects the null hypothesis. It is sometimes referred to as the protected LSD method.

Example

We did the F test for the mothers and cocaine use, so now we can do Fisher’s LSD:

round(t.test(mothers$Length[mothers$Status=="Drug Free"],
       mothers$Length[mothers$Status=="First Trimester"])$p.value, 3)
## [1] 0.019
round(t.test(mothers$Length[mothers$Status=="First Trimester"],
       mothers$Length[mothers$Status=="Throughout"])$p.value, 3)
## [1] 0.124

so again we find a difference between Drug Free and First Trimester but not between First Trimester and Throughout.

pairwise.t.test(mothers$Length, mothers$Status, p.adjust.method="none") 
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  mothers$Length and mothers$Status 
## 
##                 Drug Free First Trimester
## First Trimester 0.042     -              
## Throughout      4.3e-05   0.145          
## 
## P value adjustment method: none

One disadvantage of using the pairwise.t.test command is that it finds all pairwise comparisons, whereas because our levels have an ordering only two are of interest.

Holm’s Method

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. 6 (2): 65–70. describes a method to adjust p values specifically for the case of pairwise comparisons. It is implemented in R in

pairwise.t.test(mothers$Length, mothers$Status) 
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  mothers$Length and mothers$Status 
## 
##                 Drug Free First Trimester
## First Trimester 0.08404   -              
## Throughout      0.00013   0.14512        
## 
## P value adjustment method: holm

Tukey’s HSD Method

Tukey’s HSD (honest significant difference) test is based on the studentized range:

\[q=\frac{\max_{i,j}\{|\bar{y}_{i.}-\bar{y}_{j.}|\}}{s/\sqrt{n}}\]

Under the normal assumption it is possible to derive the distribution of q. This is implemented in

round(TukeyHSD(aov(Length~Status, data=mothers))$Status[,4], 3)
##  First Trimester-Drug Free       Throughout-Drug Free 
##                      0.103                      0.000 
## Throughout-First Trimester 
##                      0.310

and unlike with the other methods neither difference between Drug Free and First Trimester nor between First Trimester and Throughout is statistically significant, although Drug Free and First Trimester is close.

Other Methods

There are many other methods that have been developed such as Newman-Keuls multiple range test, Duncan’s range test and others

Suggested Method

Many studies have been done to investigate these methods. The general conclusion is that Tukey’s HSD test provides a good performance in most cases, and is therefore the recommended method.