In the case of two categorical variables knowing that they are somehow related is usually enough, beyond that one simply considers the percentages.
The case of a categorical predictor with 2 groups and a quantitative response is done - the two groups are different. The only other thing one might do is find a confidence interval for the differences in means, see 2-sample t method.
we have previously run the ANOVA and found that there are differences between the lengths of the babies of different groups. We can go a step further, though and ask the following questions:
is there a difference between the Drug Free and the First Trimester group?
is there a difference between the First Trimester and the Throughout group?
in other words, we can try to study the pairwise differences, which is an example of a multiple comparison study.
As we said before we could do this by running the 2 sample t test on each pair, but then we would be doing simultaneous inference. What we need is a method that does this but in such a way that the overall type I error probability is the desired \(\alpha\), no matter how many tests are done. R has a number of such methods implemented, we will use the one due to John Tukey, one of the founders of modern Statistics
attach(mothers)
tukey(Length, Status)
## Groups that are statistically significantly different:
## Groups p.value
## 1 Drug Free-Throughout 0
What does this tell us? To find out we first need to see the groups in the order of their means. We already know this here but in general a nice command to get that is
stat.table(Length, Status, Sort=TRUE)
## Sample Size Mean Standard Deviation
## Throughout 36 48.0 3.6
## First Trimester 19 49.3 2.5
## Drug Free 39 51.1 2.9
Now we are told that the only stat. significant difference is between Drug free and Throughout, so of course
the difference between Drug Free and First Trimester is NOT stat. significant
the difference between First Trimester and Throughout is NOT stat. significant
BUT: most importantly we need to remember the difference between failing to reject H0 and accept H0, so this does NOT say that there is no stat. significant difference between (say) Drug Free and First Trimester (why not?)
so now we have the following interpretation:
There is a stat. signif. difference between the mean lengths of the babies of Drug Free mothers and those who took cocain throughout the pregnancy. Other differences are not stat. signif., at least not at these sample sizes
Note It is theoretically possible that the oneway command find a statistically singificant difference, but Tukey does not, and vice versa! What you want to do is this: run the oneway command
If it DOES NOT reject the null of some differences, DO NOTHING
If is DOES reject the null, run tukey.
That cuckoo eggs were peculiar to the locality where found was already known in 1892. A study by E.B. Chance in 1940 called The Truth About the Cuckoo demonstrated that cuckoos return year after year to the same territory and lay their eggs in the nests of a particular host species. Further, cuckoos appear to mate only within their territory. Therefore, geographical sub-species are developed, each with a dominant foster-parent species, and natural selection has ensured the survival of cuckoos most fitted to lay eggs that would be adopted by a particular foster-parent species. The data has the lengths of cuckoo eggs found in the nests of six other bird species (drawn from the work of O.M. Latter in 1902).
Basic question: is there a difference between the lengths of the cuckoo eggs of different Foster species?
attach(cuckoo)
head(cuckoo)
## Bird Length
## 1 Meadow Pipit 19.65
## 2 Meadow Pipit 20.05
## 3 Meadow Pipit 20.65
## 4 Meadow Pipit 20.85
## 5 Meadow Pipit 21.65
## 6 Meadow Pipit 21.65
table(Bird)
## Bird
## Hedge Sparrow Meadow Pipit Pied Wagtail Robin Tree Pipit
## 14 45 15 16 15
## Wren
## 15
bplot(Length, Bird, new_order = "Size")
where we ordered the boxes by size because the categorical variable here has no obvious ordering.
we have some outliers in the Meadow Pipit species, but not to bad and we will ignore that.
Let’s look at the table of summary statistics.
stat.table(Length, Bird, Sort=TRUE)
## Sample Size Mean Standard Deviation
## Wren 15 21.1 0.7
## Meadow Pipit 45 22.3 0.9
## Robin 16 22.6 0.7
## Pied Wagtail 15 22.9 1.1
## Tree Pipit 15 23.1 0.9
## Hedge Sparrow 14 23.1 1.1
Both the graph and the table make it clear that there are some differences in the length, so the following is not really necessary:
oneway(Length, Bird)
## p value of test of equal means: p = 0.000
## Smallest sd: 0.7 Largest sd : 1.1
Assumptions of the method:
residuals have a normal distribution, plot looks ok
groups have equal variance
smallest stdev=0.7, largest stdev=1.1, 3*0.7=2.1>1.1, ok
So, how exactly do they differ?
tukey(Length, Bird)
## Groups that are statistically significantly different:
## Groups p.value
## 1 Meadow Pipit-Wren 0.0000
## 2 Robin-Wren 0.0000
## 3 Pied Wagtail-Wren 0.0000
## 4 Tree Pipit-Wren 0.0000
## 5 Hedge Sparrow-Wren 0.0000
## 6 Tree Pipit-Meadow Pipit 0.0475
## 7 Hedge Sparrow-Meadow Pipit 0.0429
so the eggs of Wrens are the smallest, and they are stat. significantly smaller than the eggs of all other birds.
Meadow Pipits are next, and they are stat. significantly smaller than the eggs of Tree Pipits and Hedge Sparrows.
no other differences are stat. significant!
On occasion one might want to see the p values of all the pairwise comparisons, for example if one wants to use an \(\alpha\) different from \(0.05\):
tukey(Length, Bird, show.all = TRUE)
## Groups p.value
## 1 Meadow Pipit-Wren 0.0000
## 2 Robin-Wren 0.0000
## 3 Pied Wagtail-Wren 0.0000
## 4 Tree Pipit-Wren 0.0000
## 5 Hedge Sparrow-Wren 0.0000
## 6 Robin-Meadow Pipit 0.9022
## 7 Pied Wagtail-Meadow Pipit 0.2325
## 8 Tree Pipit-Meadow Pipit 0.0475
## 9 Hedge Sparrow-Meadow Pipit 0.0429
## 10 Pied Wagtail-Robin 0.9155
## 11 Tree Pipit-Robin 0.6160
## 12 Hedge Sparrow-Robin 0.5726
## 13 Tree Pipit-Pied Wagtail 0.9932
## 14 Hedge Sparrow-Pied Wagtail 0.9872
## 15 Hedge Sparrow-Tree Pipit 1.0000
Notice that the pairs in tukey are also in the order from smallest to largest: first comes Meadow Pipit - Wren, the two birds with the smallest mean lengths.