Are the mean lengths of the babies different depending on the drug use of the mother?
ggplot(mothers, aes(Status, Length)) +
geom_boxplot()
out <- matrix(0, 3, 3)
colnames(out) <- c("Size", "Mean", "SD")
rownames(out) <- unique(mothers$Status)
out[, 1] <- tapply(mothers$Length,
mothers$Status, length)
out[, 2] <- round(tapply(mothers$Length,
mothers$Status, mean), 2)
out[, 3] <- round(tapply(mothers$Length,
mothers$Status, sd), 2)
kable.nice(out)
Size | Mean | SD | |
---|---|---|---|
Drug Free | 39 | 51.1 | 2.9 |
First Trimester | 19 | 49.3 | 2.5 |
Throughout | 36 | 48.0 | 3.6 |
The standard method for this problem is called ANOVA (Analysis of Variance) and is run with the aov command.
fit <- aov(Length~Status, data=mothers)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Status 2 181.4 90.69 9.319 0.000208
## Residuals 91 885.6 9.73
In step 3 we have the assumptions
df <- data.frame(x=resid(fit))
ggplot(df, aes(sample=x)) +
stat_qq() + stat_qq_line()
looks fine
Here one uses the rule of thumb: if the largest sample standard deviation is not more than three times the smallest, it is ok.
Here: 3*2.5 = 7.5 > 3.6, ok
One can also do a formal test for homogeneity (aka equal variance). There are a number of them, one that works quite well is Levene’s test:
library(car)
leveneTest(Length~Status, data=mothers)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.0114 0.3678
## 91
We will discuss what to do when there are problems with the normal assumption soon. If the issue is unequal variances there is simple way to go:
oneway.test(Length~Status, data=mothers)
##
## One-way analysis of means (not assuming equal variances)
##
## data: Length and Status
## F = 8.6797, num df = 2.000, denom df = 51.749, p-value = 0.0005616
The advantage of aov is a slightly higher power.
Why is this called Analysis of Variance when it is about the differences in means? Consider the two data sets shown here:
In Data 1 the means seem quite similar whereas in Data 2 they are not. Notice so:
# Data 1
round(c(Overall=var(df$x), tapply(df$x, df$y, var)), 2)
## Overall A B
## 8.90 9.48 8.43
# Data 2
round(c(Overall=var(df$z), tapply(df$z, df$y, var)), 2)
## Overall A B
## 14.73 9.48 8.35
In Data 1 the overall variance is about the same as the in-group variances, whereas in Data 2 it is much larger. So by comparing the within-group variances with the overall variance we can test whether the group means are the same.
Often if the null of no difference is rejected, one wants to go a step further and do a pairwise comparison:
There are a number of methods known for this problem, a popular one is by Tukey:
tuk <- TukeyHSD(fit)
plot(tuk)
this draws confidence intervals for the difference in means of all pairs. If an interval does not contain 0, the corresponding pair is statistically significantly different.
Here that is the case only for Drug Free - Throughout, so the other two pairs are not statistically significantly different. Remember, however that failing to reject H0 is NOT the same as accepting H0. The fact that those pairs are not statistically significantly different is almost certainly due to a lack of sample size.
That cuckoo eggs were peculiar to the locality where found was already known in 1892. A study by E.B. Chance in 1940 called The Truth About the Cuckoo demonstrated that cuckoos return year after year to the same territory and lay their eggs in the nests of a particular host species. Further, cuckoos appear to mate only within their territory. Therefore, geographical sub-species are developed, each with a dominant foster-parent species, and natural selection has ensured the survival of cuckoos most fitted to lay eggs that would be adopted by a particular foster-parent species. The data has the lengths of cuckoo eggs found in the nests of six other bird species (drawn from the work of O.M. Latter in 1902).
Basic question: is there a difference between the lengths of the cuckoo eggs of different Foster species?
kable.nice(head(cuckoo))
Bird | Length |
---|---|
Meadow Pipit | 19.65 |
Meadow Pipit | 20.05 |
Meadow Pipit | 20.65 |
Meadow Pipit | 20.85 |
Meadow Pipit | 21.65 |
Meadow Pipit | 21.65 |
How many of each birds nests do we have?
df <- data.frame(
Birds=names(table(cuckoo$Bird)),
Counts=as.numeric(table(cuckoo$Bird)))
kable.nice(df)
Birds | Counts |
---|---|
Hedge Sparrow | 14 |
Meadow Pipit | 45 |
Pied Wagtail | 15 |
Robin | 16 |
Tree Pipit | 15 |
Wren | 15 |
and we see that they differ. Whether or not this is a problem depends on the way the data is analyzed, in what follows it is ok.
The variable Bird has no obvious ordering. In this case the usual thing to do is to sort by the group means:
mn <- sort(tapply(cuckoo$Length, cuckoo$Bird, mean))
cuckoo$Bird <- factor(cuckoo$Bird,
levels = unique(names(mn)),
ordered = TRUE)
ggplot(data=cuckoo, aes(Bird, Length)) +
geom_boxplot()
we have some outliers in the Meadow Pipit species, but not to bad and we will ignore that.
Let’s look at the table of summary statistics.
out <- matrix(0, 6, 3)
colnames(out) <- c("n", "Mean", "Sd")
rownames(out) <- as.character(levels(cuckoo$Bird))
out[, 1] <- tapply(cuckoo$Length,
cuckoo$Bird, length)
out[, 2] <- round(tapply(cuckoo$Length,
cuckoo$Bird, mean), 2)
out[, 3] <- round(tapply(cuckoo$Length,
cuckoo$Bird, sd), 2)
kable.nice(out)
n | Mean | Sd | |
---|---|---|---|
Wren | 15 | 21.13 | 0.74 |
Meadow Pipit | 45 | 22.30 | 0.92 |
Robin | 16 | 22.57 | 0.68 |
Pied Wagtail | 15 | 22.90 | 1.07 |
Tree Pipit | 15 | 23.09 | 0.90 |
Hedge Sparrow | 14 | 23.12 | 1.07 |
Both the graph and the table make it clear that there are some differences in the length, so the following is not really necessary:
fit <- aov(Length~Bird, data=cuckoo)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Bird 5 42.94 8.588 10.39 3.15e-08
## Residuals 114 94.25 0.827
Assumptions of the method:
df <- data.frame(Residuals=resid(fit),
Fits = fitted(fit))
ggplot(data=df, aes(sample=Residuals)) +
geom_qq() + geom_qq_line()
smallest stdev=0.7, largest stdev=1.1, 3*0.7=2.1>1.1, ok
So, how exactly do they differ?
tuk <- TukeyHSD(fit)
print(tuk)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Length ~ Bird, data = cuckoo)
##
## $Bird
## diff lwr upr p adj
## Meadow Pipit-Wren 1.16888889 0.383069115 1.954709 0.0004861
## Robin-Wren 1.44500000 0.497728567 2.392271 0.0003183
## Pied Wagtail-Wren 1.77333333 0.810904595 2.735762 0.0000070
## Tree Pipit-Wren 1.96000000 0.997571262 2.922429 0.0000006
## Hedge Sparrow-Wren 1.99142857 1.011964373 2.970893 0.0000006
## Robin-Meadow Pipit 0.27611111 -0.491069969 1.043292 0.9021876
## Pied Wagtail-Meadow Pipit 0.60444444 -0.181375330 1.390264 0.2324603
## Tree Pipit-Meadow Pipit 0.79111111 0.005291337 1.576931 0.0474619
## Hedge Sparrow-Meadow Pipit 0.82253968 0.015945760 1.629134 0.0428621
## Pied Wagtail-Robin 0.32833333 -0.618938100 1.275605 0.9155004
## Tree Pipit-Robin 0.51500000 -0.432271433 1.462271 0.6159630
## Hedge Sparrow-Robin 0.54642857 -0.418146053 1.511003 0.5726153
## Tree Pipit-Pied Wagtail 0.18666667 -0.775762072 1.149095 0.9932186
## Hedge Sparrow-Pied Wagtail 0.21809524 -0.761368960 1.197559 0.9872190
## Hedge Sparrow-Tree Pipit 0.03142857 -0.948035627 1.010893 0.9999990
so the eggs of Wrens are the smallest, and they are stat. significantly smaller than the eggs of all other birds.
Meadow Pipits are next, and they are stat. significantly smaller than the eggs of Tree Pipits and Hedge Sparrows.
no other differences are stat. significant!
This can get a bit hard to read, and it might be better to concentrate on those pairs that are stat. signif,. different at (say) the \(5\%\) level:
names(tuk[[1]][tuk[[1]][,4]<0.05, 4])
## [1] "Meadow Pipit-Wren" "Robin-Wren"
## [3] "Pied Wagtail-Wren" "Tree Pipit-Wren"
## [5] "Hedge Sparrow-Wren" "Tree Pipit-Meadow Pipit"
## [7] "Hedge Sparrow-Meadow Pipit"