Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use. The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn. Each baby was classified by the cocaine use of the mother: Free-no drugs of any kind, Trimester-mothers used cocain but stopped during the first trimester (three month of pregnancy and Throughout-mother used cocaine until birth.
Is there a statistically significant difference between the groups?
Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff Obstetrics & Gynecology 1989;74:882-885
kable.nice(mothers[c(1:2, 40:41, 80:81), ])
Status | Length | |
---|---|---|
1 | Drug Free | 44.3 |
2 | Drug Free | 45.3 |
40 | First Trimester | 45.1 |
41 | First Trimester | 45.7 |
80 | Throughout | 48.5 |
81 | Throughout | 49.0 |
What is a probability model here? In it’s most general form it is as follows: we have observations \[ X_{ij} \sim F_i\text{, } i=1,..,k\text{, } j=1,..,n_i \] that is each group has it’s own distribution. A look at the boxplot and the normal probability plots makes it appear, though, as if each of the distributions were actually normal:
ggplot(mothers, aes(Status, Length)) +
geom_boxplot()
ggplot(data=mothers, aes(sample=Length)) +
geom_qq() + geom_qq_line()
So we can write the probability model:
\[ X_{ij} \sim N(\mu_i, \sigma_i)\text{, } i=1,..,k\text{, } j=1,..,n_i \] Especially the boxplot strongly suggests a further simplification of the model, namely that the standard deviations are the same, so we have
\[ X_{ij} \sim N(\mu_i, \sigma)\text{, } i=1,..,k\text{, } j=1,..,n_i \]
Standard ANOVA terminology would write this model as follows:
\[ X_{ij} = \mu_i + \epsilon_{ij} \\ \epsilon_{ij} \sim N(0, \sigma) \] where the \(\epsilon_{ij}\) are called the residuals.
The basic ANOVA test is then
\[H_0: \mu_1=\mu_2=\mu_3$ vs $H_a: \mu_i \ne \mu_j\]
for some i and j.
Let’s derive the likelihood ratio test for this problem:
\[ \begin{aligned} &f(x_{11},..,x_{kn_k}\vert \pmb{\mu}) =\prod_{i=1}^k\prod_{j=1}^{n_i} \frac1{\sqrt{2\pi\sigma^2}}\exp \left\{-\frac1{2\sigma^2}\left(x_{ij}-\mu_i\right)^2 \right\} =\\ &(2\pi\sigma^2)^{-n/2} \exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu_i\right)^2 \right\}\\ &l(\pmb{\mu}\vert \pmb{x}) =\frac{n}2\log(2\pi\sigma^2) -\frac1{2\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu_i\right)^2\\ &\frac{dl(\pmb{\mu}\vert \pmb{x})}{d\mu_i} = \frac1{\sigma^2}\sum_{j=1}^{n_i}\left(x_{ij}-\mu_i\right)=0\\ &\hat{\mu_i}=\frac1{n_i}\sum_{j=1}^{n_i}x_{ij}=:\bar{x}{_i.}\\ &\frac{dl(\pmb{\mu}\vert \pmb{x})}{d(\sigma^2)} = -\frac{n}{\sigma^2}+\frac1{2(\sigma^2)^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu_i\right)^2=0\\ &\widehat{\sigma^2}=\frac1n \sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2 \end{aligned} \]
under the null hypothesis
\[H_0: \mu_1= ... =\mu_k=\mu\] and so
\[ \begin{aligned} &f(x_{11},..,x_{kn_k}\vert \mu) = L(\pmb{\mu},\sigma^2\vert\pmb{x})\\ &(2\pi\sigma^2)^{-n/2} = \exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu\right)^2 \right\}\\ &l(\mu\vert \pmb{x}) =\frac{n}2\log(2\pi\sigma^2) -\frac1{2\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu\right)^2\\ &\frac{dl(\mu\vert \pmb{x})}{d\mu} = \frac1{\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu\right)=0\\ &\hat{\hat{\mu}}=\frac1{n}\sum_{i=1}^k\sum_{j=1}^{n_i}x_{ij}=:\bar{x}_{..}\\ &\frac{dl(\pmb{\mu}\vert \pmb{x})}{d(\sigma^2)} = -\frac{n}{\sigma^2}+\frac1{2(\sigma^2)^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu\right)^2=0\\ &\widehat{\widehat{\sigma^2}}=\frac1n \sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2 \end{aligned} \] Now we find the likelihood ratio test statistic:
\[ \begin{aligned} &\lambda(\pmb{x}) = \frac{L(\widehat{\widehat{\mu}},\widehat{\widehat{\sigma^2}}\vert\pmb{x})}{L(\hat{\pmb{\mu}},\widehat{\sigma^2}\vert\pmb{x})}=\\ &\frac{(2\pi\widehat{\widehat{\sigma^2}})^{-n/2} \exp \left\{-\frac1{2\widehat{\widehat{\sigma^2}}}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2 \right\}}{(2\pi\widehat{\sigma^2})^{-n/2} \exp \left\{-\frac1{2\widehat{\sigma^2}}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2 \right\}} = \\ &\frac{(\widehat{\widehat{\sigma^2}})^{-n/2} \exp \left\{-\frac1{2\widehat{\widehat{\sigma^2}}}n\widehat{\widehat{\sigma^2}} \right\}}{(\widehat{\sigma^2})^{-n/2} \exp \left\{-\frac1{2\widehat{\sigma^2}}n\widehat{\sigma^2} \right\}} = \\ &\left(\frac{\widehat{\sigma^2}}{\widehat{\widehat{\sigma^2}}}\right)^{n/2} \end{aligned} \]
you can now see why this is called the analysis of variance although it really is a method concerned with means. Now as always
\[ \begin{aligned} &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}+\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left[\left(x_{ij}-\bar{x}_{..}\right)^2+2\left(x_{ij}-\bar{x}_{..}\right)\left(\bar{x}_{..}-\bar{x}_{i.}\right)+\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 \right] = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2+2\left(\bar{x}_{..}-\bar{x}_{i.}\right)\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)+n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2+2\left(\bar{x}_{..}-\bar{x}_{i.}\right)\left(\sum_{j=1}^{n_i}x_{ij}-n_i\bar{x}_{..}\right)+n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2+2\left(\bar{x}_{..}-\bar{x}_{i.}\right)\left(n_i\bar{x}_{i.}-n_i\bar{x}_{..}\right)+n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2-2n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2+n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2-n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2\\ \end{aligned} \] summing over k yields
\[\widehat{\sigma^2} = \widehat{\widehat{\sigma^2}}-\frac1n\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2\]
and so
\[ \begin{aligned} &\lambda(\pmb{x}) = \left(\frac{\widehat{\sigma^2}}{\widehat{\widehat{\sigma^2}}}\right)^{n/2} = \\ & \left(\frac{\widehat{\sigma^2}}{\widehat{\sigma^2}+\frac1n\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2}\right)^{n/2} = \\ &\left(\frac{1}{1+\frac{\frac1n\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2}{\widehat{\sigma^2}}}\right)^{n/2} = \\ &\left(\frac{1}{1+\frac{\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2}{ \sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2}}\right)^{n/2} \end{aligned} \] and it can be shown that \(\lambda(\pmb{x})\) is large if and only if
\[\frac{\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2}{ \sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2}\] is large.
In the numerator we have an estimate of the variance between the group means and the overall mean, and in the denominator an estimate of the variance within the groups. It is easy to show that this test statistic (when properly scaled) has an F distribution with k-1 and n-k degrees of freedom.
Let’s find all the relevant numbers for our data set:
x.. <- mean(mothers$Length)
sum((mothers$Length-x..)^2)
## [1] 1066.955
round(x.., 2)
## [1] 49.55
n <- tapply(mothers$Length, mothers$Status, length)
n
## Drug Free First Trimester Throughout
## 39 19 36
x. <- tapply(mothers$Length, mothers$Status, mean)
round(x., 2)
## Drug Free First Trimester Throughout
## 51.1 49.3 48.0
num <- sum(n*(x.-x..)^2)
num
## [1] 181.3749
residuals <- 0*mothers$Length
st <- unique(mothers$Status)
for(i in 1:3)
residuals[mothers$Status==st[i]] <-
mothers$Length[mothers$Status==st[i]]-x.[i]
denom <- sum(residuals^2)
denom
## [1] 885.58
mean.square <- c(num/2, denom/91)
round(mean.square, 2)
## [1] 90.69 9.73
Fs <- mean.square[1]/mean.square[2]
round(Fs, 3)
## [1] 9.319
1-pf(Fs, 2, 91)
## [1] 0.0002080747
now the information is usually summarized in an ANOVA table:
df <- data.frame(DF=c(2, 9),
SS=c(181.375, 885.58),
Mean=c(90.69, 9.73),
F=c(9.318, ""),
p.value=c(0.002, ""))
rownames(df) <- c("Status", "Residuals")
kable.nice(df)
DF | SS | Mean | F | p.value | |
---|---|---|---|---|---|
Status | 2 | 181.375 | 90.69 | 9.318 | 0.002 |
Residuals | 9 | 885.580 | 9.73 |
or of course we can use R:
summary(aov(Length~Status, data=mothers))
## Df Sum Sq Mean Sq F value Pr(>F)
## Status 2 181.4 90.69 9.319 0.000208
## Residuals 91 885.6 9.73
The basic F test is rarely of great interest, ANOVA becomes more interesting when we test more specific hypotheses.
A contrast is an expression of the form
\(\sum a_i \mu_i\)
where \((a_1,..,a_k)\) is such that \(\sum a_i=0\).
With this notation we can write many interesting hypotheses:
\(H_0: \sum a_i\mu_i =0\)
for all \((a_1,..,a_k)\) is such that \(\sum a_i=0\)
\(H_0: \sum a_i \mu_i =0\) for (1,-1,0) (1,0,-1) and (0,1,-1)
\(H_0: \sum a_i \mu _i =0\) for (1/2,1/2,1)
tests whether
\((\mu_1+ \mu_2)/2= \mu_3\)
To test \(H_0: \sum a_i \mu_i=0\) vs \(H_a: \sum a_i \mu_i \ne 0\) use the test statistic
\[T=\frac{\vert \sum_{i=1}^k a_i\bar{x}_{i.}\vert}{s_p\sqrt{\sum_{i=1}^k a_i^2/n_i}}\]
where
\[s_p^2=\frac1{n-k}\sum_j\sum_i (x_{ij}-\bar{x}_{i.})^2\] then \(T\sim t(n-k)\)
We have
\[H_0: \sum a_i \mu _i =0\text{ vs }H_a: \sum a_i \mu_i \ne 0\]
for a=(0,1,-1)
so
a <- c(0, 1, -1)
Ts <- abs(sum(x.*a))/sqrt(mean.square[2]*(1/n[2]+1/n[3]))
out <- round(c(Ts, 1-pt(Ts, sum(n)-2)), 3)
names(out) <- c("T", "p-value")
out
## T p-value
## 1.470 0.073
One important point to remember here is that the classical ANOVA method is just a straightforward application of the likelihood ratio test.
This will of course always depend on the priors on \((\mu_1,.., \mu_k, \sigma)\). If we choose noninfomative priors, for example proportional to \(1/\sigma^2\), then we recover essentially the classical ANOVA above.