Comparing the Means of Several Populations - ANOVA

Basic Test

Example: Mothers Cocain Use and Babies Health

Are the mean lengths of the babies different depending on the drug use of the mother?

ggplot(mothers, aes(Status, Length)) +
         geom_boxplot()

out <- matrix(0, 3, 3)
colnames(out) <- c("Size", "Mean", "SD")
rownames(out) <- unique(mothers$Status)
out[, 1] <- tapply(mothers$Length, 
                   mothers$Status, length)
out[, 2] <- round(tapply(mothers$Length, 
                         mothers$Status, mean), 2)
out[, 3] <- round(tapply(mothers$Length, 
                         mothers$Status, sd), 2)
kable.nice(out)
Size Mean SD
Drug Free 39 51.1 2.9
First Trimester 19 49.3 2.5
Throughout 36 48.0 3.6

The standard method for this problem is called ANOVA (Analysis of Variance) and is run with the aov command.

fit <- aov(Length~Status, data=mothers)
summary(fit)
##             Df Sum Sq Mean Sq F value   Pr(>F)
## Status       2  181.4   90.69   9.319 0.000208
## Residuals   91  885.6    9.73
  1. Parameters of interest: group means
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha\)=0.05
  5. Null hypothesis H0: \(\mu_1 = \mu_2 = \mu_3\) (groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (at least two groups have different means)
  7. p=0.0002
  8. 0.0002 < 0.05, there is some evidence that the group means are not the same, the babies whose mothers used cocaine tend to be a little shorter (less healthy?)

In step 3 we have the assumptions

  1. residuals have a normal distribution. We can check that with the normal plot. The residuals are simply the observations minus their group means and are part of the aov object.
df <- data.frame(x=resid(fit))
ggplot(df, aes(sample=x)) +
  stat_qq() + stat_qq_line()

looks fine

  1. groups have equal variance

Here one uses the rule of thumb: if the largest sample standard deviation is not more than three times the smallest, it is ok.

Here: 3*2.5 = 7.5 > 3.6, ok

One can also do a formal test for homogeneity (aka equal variance). There are a number of them, one that works quite well is Levene’s test:

library(car)
leveneTest(Length~Status, data=mothers)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2  1.0114 0.3678
##       91

What to do if assumptions fail

We will discuss what to do when there are problems with the normal assumption soon. If the issue is unequal variances there is simple way to go:

oneway.test(Length~Status, data=mothers)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  Length and Status
## F = 8.6797, num df = 2.000, denom df = 51.749, p-value = 0.0005616

The advantage of aov is a slightly higher power.

Why Analysis of Variance?

Why is this called Analysis of Variance when it is about the differences in means? Consider the two data sets shown here:

In Data 1 the means seem quite similar whereas in Data 2 they are not. Notice so:

# Data 1
round(c(Overall=var(df$x), tapply(df$x, df$y, var)), 2)
## Overall       A       B 
##    8.90    9.48    8.43
# Data 2
round(c(Overall=var(df$z), tapply(df$z, df$y, var)), 2)
## Overall       A       B 
##   14.73    9.48    8.35

In Data 1 the overall variance is about the same as the in-group variances, whereas in Data 2 it is much larger. So by comparing the within-group variances with the overall variance we can test whether the group means are the same.

Multiple Comparison

Often if the null of no difference is rejected, one wants to go a step further and do a pairwise comparison:

  • is Drug Free different from First Trimester?
  • is First Trimester different from Throughout?

There are a number of methods known for this problem, a popular one is by Tukey:

tuk <- TukeyHSD(fit)
plot(tuk)

this draws confidence intervals for the difference in means of all pairs. If an interval does not contain 0, the corresponding pair is statistically significantly different.

Here that is the case only for Drug Free - Throughout, so the other two pairs are not statistically significantly different. Remember, however that failing to reject H0 is NOT the same as accepting H0. The fact that those pairs are not statistically significantly different is almost certainly due to a lack of sample size.

Example: Cuckoo Eggs

That cuckoo eggs were peculiar to the locality where found was already known in 1892. A study by E.B. Chance in 1940 called The Truth About the Cuckoo demonstrated that cuckoos return year after year to the same territory and lay their eggs in the nests of a particular host species. Further, cuckoos appear to mate only within their territory. Therefore, geographical sub-species are developed, each with a dominant foster-parent species, and natural selection has ensured the survival of cuckoos most fitted to lay eggs that would be adopted by a particular foster-parent species. The data has the lengths of cuckoo eggs found in the nests of six other bird species (drawn from the work of O.M. Latter in 1902).

Cuckoo Birds

Basic question: is there a difference between the lengths of the cuckoo eggs of different Foster species?

kable.nice(head(cuckoo))
Bird Length
Meadow Pipit 19.65
Meadow Pipit 20.05
Meadow Pipit 20.65
Meadow Pipit 20.85
Meadow Pipit 21.65
Meadow Pipit 21.65

How many of each birds nests do we have?

df <- data.frame(
  Birds=names(table(cuckoo$Bird)),
  Counts=as.numeric(table(cuckoo$Bird)))
kable.nice(df)
Birds Counts
Hedge Sparrow 14
Meadow Pipit 45
Pied Wagtail 15
Robin 16
Tree Pipit 15
Wren 15

and we see that they differ. Whether or not this is a problem depends on the way the data is analyzed, in what follows it is ok.

The variable Bird has no obvious ordering. In this case the usual thing to do is to sort by the group means:

mn <- sort(tapply(cuckoo$Length, cuckoo$Bird, mean))
cuckoo$Bird <- factor(cuckoo$Bird,
                      levels = unique(names(mn)),
                      ordered = TRUE)
ggplot(data=cuckoo, aes(Bird, Length)) +
  geom_boxplot()

we have some outliers in the Meadow Pipit species, but not to bad and we will ignore that.

Let’s look at the table of summary statistics.

out <- matrix(0, 6, 3)
colnames(out) <- c("n", "Mean", "Sd")
rownames(out) <- as.character(levels(cuckoo$Bird))
out[, 1] <- tapply(cuckoo$Length, 
                   cuckoo$Bird, length)
out[, 2] <- round(tapply(cuckoo$Length, 
                   cuckoo$Bird, mean), 2)
out[, 3] <- round(tapply(cuckoo$Length, 
                   cuckoo$Bird, sd), 2)
kable.nice(out)
n Mean Sd
Wren 15 21.13 0.74
Meadow Pipit 45 22.30 0.92
Robin 16 22.57 0.68
Pied Wagtail 15 22.90 1.07
Tree Pipit 15 23.09 0.90
Hedge Sparrow 14 23.12 1.07

Both the graph and the table make it clear that there are some differences in the length, so the following is not really necessary:

fit <- aov(Length~Bird, data=cuckoo)
summary(fit)
##              Df Sum Sq Mean Sq F value   Pr(>F)
## Bird          5  42.94   8.588   10.39 3.15e-08
## Residuals   114  94.25   0.827
  1. Parameters of interest: group means
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. \(\alpha = 0.05\)
  5. Null hypothesis H0: \(\mu_1 = ... = \mu_6\) (groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (at least two groups have different means)
  7. p value = 0.000
  8. 0.000 < 0.05, there is some evidence that the group means are not the same, the length are different for different foster species.

Assumptions of the method:

  1. residuals have a normal distribution, plot looks (mostly) ok
df <- data.frame(Residuals=resid(fit), 
            Fits = fitted(fit))
ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line()       

  1. groups have equal variance

smallest stdev=0.7, largest stdev=1.1, 3*0.7=2.1>1.1, ok

So, how exactly do they differ?

tuk <- TukeyHSD(fit)
print(tuk)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Length ~ Bird, data = cuckoo)
## 
## $Bird
##                                  diff          lwr      upr     p adj
## Meadow Pipit-Wren          1.16888889  0.383069115 1.954709 0.0004861
## Robin-Wren                 1.44500000  0.497728567 2.392271 0.0003183
## Pied Wagtail-Wren          1.77333333  0.810904595 2.735762 0.0000070
## Tree Pipit-Wren            1.96000000  0.997571262 2.922429 0.0000006
## Hedge Sparrow-Wren         1.99142857  1.011964373 2.970893 0.0000006
## Robin-Meadow Pipit         0.27611111 -0.491069969 1.043292 0.9021876
## Pied Wagtail-Meadow Pipit  0.60444444 -0.181375330 1.390264 0.2324603
## Tree Pipit-Meadow Pipit    0.79111111  0.005291337 1.576931 0.0474619
## Hedge Sparrow-Meadow Pipit 0.82253968  0.015945760 1.629134 0.0428621
## Pied Wagtail-Robin         0.32833333 -0.618938100 1.275605 0.9155004
## Tree Pipit-Robin           0.51500000 -0.432271433 1.462271 0.6159630
## Hedge Sparrow-Robin        0.54642857 -0.418146053 1.511003 0.5726153
## Tree Pipit-Pied Wagtail    0.18666667 -0.775762072 1.149095 0.9932186
## Hedge Sparrow-Pied Wagtail 0.21809524 -0.761368960 1.197559 0.9872190
## Hedge Sparrow-Tree Pipit   0.03142857 -0.948035627 1.010893 0.9999990

so the eggs of Wrens are the smallest, and they are stat. significantly smaller than the eggs of all other birds.

Meadow Pipits are next, and they are stat. significantly smaller than the eggs of Tree Pipits and Hedge Sparrows.

no other differences are stat. significant!

This can get a bit hard to read, and it might be better to concentrate on those pairs that are stat. signif,. different at (say) the \(5\%\) level:

names(tuk[[1]][tuk[[1]][,4]<0.05, 4])
## [1] "Meadow Pipit-Wren"          "Robin-Wren"                
## [3] "Pied Wagtail-Wren"          "Tree Pipit-Wren"           
## [5] "Hedge Sparrow-Wren"         "Tree Pipit-Meadow Pipit"   
## [7] "Hedge Sparrow-Meadow Pipit"