In many experiments one of a (relatively few) treatments is applied to each subject, and the interest is in the effects of the treatments. For example, in an experiment in agriculture one might treat each plant with one of four different fertilizers. In a medical trial a new vaccince might be compared to an old vaccince as well as to a placebo and maybe even no treatment.
In an Analysis of Variance (ANOVA) the focus is on the mean responses per treatment.
Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use.
The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn.
Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff Obstetrics and Gynecology 1989;74:882-885
Here is a look at the data
kable.nice(mothers[1:10, ], do.row.names = FALSE)
Status | Length |
---|---|
Drug Free | 44.3 |
Drug Free | 45.3 |
Drug Free | 46.9 |
Drug Free | 47.0 |
Drug Free | 47.2 |
Drug Free | 47.8 |
Drug Free | 47.8 |
Drug Free | 48.5 |
Drug Free | 48.8 |
Drug Free | 49.6 |
nrow(mothers)
## [1] 94
table(mothers$Status)
##
## Drug Free First Trimester Throughout
## 39 19 36
so there were a total of 94 babies. 39 had mothers with no drug use, 19 had mothers who stopped the cocaine use early during pregnancy and 36 had mothers who continued to take cocaine until birth.
Basic Question: does the cocaine use of the mother influence the health of the baby?
A good way to start is with a multiple boxplot:
ggplot(data=mothers , aes(Status, Length)) +
geom_boxplot()
which does look like that more drug use leads to smaller (and likely less healthy) babies.
Here is another way to summarize the data:
out <- matrix(0, 3, 3)
colnames(out) <- c("Size", "Mean", "SD")
rownames(out) <- unique(mothers$Status)
out[, 1] <- tapply(mothers$Length,
mothers$Status, length)
out[, 2] <- round(tapply(mothers$Length,
mothers$Status, mean), 2)
out[, 3] <- round(tapply(mothers$Length,
mothers$Status, sd), 2)
kable.nice(out)
Size | Mean | SD | |
---|---|---|---|
Drug Free | 39 | 51.1 | 2.9 |
First Trimester | 19 | 49.3 | 2.5 |
Throughout | 36 | 48.0 | 3.6 |
and again it seems there is an effect.
How can we write this as a model? Typically this is done with
\[y_{ij}=\mu+\alpha_i+\epsilon_{ij}\]
with i=1,2,3 and j1=39, j2=19, j3=36
Here \(\mu\) is the overall mean (disregarding the treatments), \(\alpha_i\) is the mean effect due to treatment i and \(\epsilon_{ij}\) is the random effect.
We can again use matrix notation to write down this model: Let
\[\pmb{y}= \begin{pmatrix} y_{1:1} \\ \vdots \\ y_{1:39}\\y_{2:1}\\\vdots\\ y_{3:36} \end{pmatrix}\]
\[ \pmb{X} = \begin{pmatrix} 1 & 1& 0 & 0\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 1 & 0 & 0\\ 1 & 0 & 1 & 0\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 0 & 1 & 0\\ 1 & 0 & 0 & 1\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 0 & 0 & 1\\ \end{pmatrix} \]
\[\pmb{\beta} = \begin{pmatrix} \mu \\ \alpha_1 \\ \alpha_2 \\ \alpha_3\end{pmatrix}\]
\[\pmb{\epsilon}= \begin{pmatrix} \epsilon_{1:1} \\ \vdots \\ \epsilon_{1:39}\\\epsilon_{2:1}\\\vdots\\ \epsilon_{3:36} \end{pmatrix}\]
then the model is
\[\pmb{y}=\pmb{X\beta}+\pmb{\epsilon}\]
and we see a model of the same form as in the regression problems before. The main difference is that now the design matrix \(\pmb{X}\) is always of the form as above, that is with 0’s and 1’s.
Here \(\pmb{X}\) is \(96\times 4\) matrix of rank 3 because the first column is equal to sum of the others, which are clearly linearly independent. Because \(\pmb{X}\) is not of full rank \(\pmb{X'X}\) is singular, \((\pmb{X'X})^{-1}\) does not exist and therefore \((\pmb{X'X})^{-1}\pmb{X'y}\) can not be used to estimate \(\pmb{\beta}\).
One reason for this is that the model as written does not determine the parameters uniquely. For example,
\[y_{11} = 10+2+\epsilon_{11} = 8+4+\epsilon_{11}\]
We say the model is overparametrized. Note that this is not an issue of the sample size, it will not go away if n is increased.
There are a number of ways to overcome this problem:
Here are examples of these technics:
\[y_{ij}=\mu_i+\epsilon_{ij}\]
Add the condition \(\alpha_1+\alpha_2+\alpha_3=0\) to the model.
rewrite the model in terms of \(\alpha_1-\alpha_2\) and \(\alpha_2-\alpha_3\), that is the changes from one treatment to the next. This makes most sense if (as is the case for our data) there is an ordering of the treatments.
Reference: Loven, Faith. (1981). A Study of the Interlist Equivalency of the CID W-22 Word List Presented in Quiet and in Noise. Unpublished MS Thesis, University of Iowa.
Description: Percent of a Standard 50-word list heard correctly in the presence of background noise. 24 subjects with normal hearing listened to standard audiology tapes of English words at low volume with a noisy background. They repeated the words and were scored correct or incorrect in their perception of the words. The order of list presentation was randomized.
The word lists are standard audiology tools for assessing hearing. They are calibrated to be equally difficult to perceive. However, the original calibration was performed with normal-hearing subjects and no noise background. The experimenter wished to determine whether the lists were still equally difficult to understand in the presence of a noisy background.
kable.nice(hearingaid[1:10, ], do.row.names = FALSE)
Subject | List | Score |
---|---|---|
1 | 1 | 28 |
2 | 1 | 24 |
3 | 1 | 32 |
4 | 1 | 30 |
5 | 1 | 34 |
6 | 1 | 30 |
7 | 1 | 36 |
8 | 1 | 32 |
9 | 1 | 48 |
10 | 1 | 32 |
so here the response variable is Score and there are two predictor variables, Subject and List. The interest is specifically in List.
Both Subject and List contain numbers, but these are really labels, so in R we need to turn them into factors:
tmp <- tapply(hearingaid$Score, hearingaid$List, mean)
hearingaid$List <- factor(hearingaid$List,
levels=order(tmp),
ordered = TRUE)
tmp <- tapply(hearingaid$Score, hearingaid$Subject, mean)
hearingaid$Subject <- factor(hearingaid$Subject,
levels=order(tmp),
ordered = TRUE)
ggplot(data=hearingaid, aes(List, Score)) +
geom_boxplot()
Also the summary table for List:
sum.tbl <-
data.frame(
List=c(3, 4, 2, 1),
n=as.integer(tapply(hearingaid$Score,
hearingaid$List,length)),
Mean=round(tapply(hearingaid$Score,
hearingaid$List, mean), 1),
Sd=round(tapply(hearingaid$Score,
hearingaid$List, sd), 2)
)
rownames(sum.tbl) <- NULL
kable.nice(sum.tbl)
List | n | Mean | Sd | |
---|---|---|---|---|
1 | 3 | 24 | 25.2 | 8.32 |
2 | 4 | 24 | 25.6 | 7.78 |
3 | 2 | 24 | 29.7 | 8.06 |
4 | 1 | 24 | 32.8 | 7.41 |
Notice the ordering by mean response.
Analogously to the previous discussion we can write the model for this data as
\[y_{ij}=\mu+\alpha_i+\beta_j+\epsilon_{ij}\]
for i=1,..,4 and j=1,..,24
In matrix form we have
\[\pmb{y}= \begin{pmatrix} y_{1:1} \\ \vdots \\ y_{4:24} \end{pmatrix}\]
\[ \pmb{X} = \begin{pmatrix} 1 & 1& 0 & 0 & 0 & 1 & 0 & ... & 0\\ 1 & 1& 0 & 0 & 0 & 0 & 1 & ... & 0\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 1& 0 & 0 & 0 & 0 & 0 & ... & 1\\ 1 & 0& 1 & 0 & 0 & 1 & 0 & ... & 0\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 0& 0 & 0 & 1 & 0 & 0 & ... & 1\\ \end{pmatrix} \]
\[\pmb{\beta} = \begin{pmatrix} \mu & \alpha_1 & ... & \alpha_4 & \beta_1 & ... &\beta_{24}\end{pmatrix}'\]
\[\pmb{\epsilon}= \begin{pmatrix} \epsilon_{11} \\ \vdots \\ \epsilon_{4:24} \end{pmatrix}\]
then the model is
\[\pmb{y}=\pmb{X\beta}+\pmb{\epsilon}\]
How can we create such a matrix \(\pmb{X}\) in R? Note for any block with 1’s in columns 2 to 5 the following column form a diagonal matrix. Let’s illustrate this with i=3,j=4 instead of 24, so it is easier to see:
make.X=function(I, J) {
X=NULL
for(i in 1:I) {
tmp=matrix(0, J, I)
tmp[, i]=1
X=rbind(X, cbind(tmp, diag(J)))
}
X=cbind(1, X)
X
}
make.X(3,4)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 1 0 0 1 0 0 0
## [2,] 1 1 0 0 0 1 0 0
## [3,] 1 1 0 0 0 0 1 0
## [4,] 1 1 0 0 0 0 0 1
## [5,] 1 0 1 0 1 0 0 0
## [6,] 1 0 1 0 0 1 0 0
## [7,] 1 0 1 0 0 0 1 0
## [8,] 1 0 1 0 0 0 0 1
## [9,] 1 0 0 1 1 0 0 0
## [10,] 1 0 0 1 0 1 0 0
## [11,] 1 0 0 1 0 0 1 0
## [12,] 1 0 0 1 0 0 0 1
The rank of our matrix \(\pmb{X}\) is
qr(make.X(4,24))$rank
## [1] 27
but there are 29 (=1+4+28) parameters, so again not all can be estimated.