Analysis of Variance (ANOVA)

Example (7.4.1)

Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use. The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn. Each baby was classified by the cocaine use of the mother: Free-no drugs of any kind, Trimester-mothers used cocain but stopped during the first trimester (three month of pregnancy and Throughout-mother used cocaine until birth.

Is there a statistically significant difference between the groups?

Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff Obstetrics & Gynecology 1989;74:882-885

kable.nice(mothers[c(1:2, 40:41, 80:81), ])
Status Length
1 Drug Free 44.3
2 Drug Free 45.3
40 First Trimester 45.1
41 First Trimester 45.7
80 Throughout 48.5
81 Throughout 49.0

What is a probability model here? In it’s most general form it is as follows: we have observations \[ X_{ij} \sim F_i\text{, } i=1,..,k\text{, } j=1,..,n_i \] that is each group has it’s own distribution. A look at the boxplot and the normal probability plots makes it appear, though, as if each of the distributions were actually normal:

ggplot(mothers, aes(Status, Length)) +
  geom_boxplot()

ggplot(data=mothers, aes(sample=Length)) +
           geom_qq() + geom_qq_line()       

So we can write the probability model:

\[ X_{ij} \sim N(\mu_i, \sigma_i)\text{, } i=1,..,k\text{, } j=1,..,n_i \] Especially the boxplot strongly suggests a further simplification of the model, namely that the standard deviations are the same, so we have

\[ X_{ij} \sim N(\mu_i, \sigma)\text{, } i=1,..,k\text{, } j=1,..,n_i \]

Standard ANOVA terminology would write this model as follows:

\[ X_{ij} = \mu_i + \epsilon_{ij} \\ \epsilon_{ij} \sim N(0, \sigma) \] where the \(\epsilon_{ij}\) are called the residuals.

The basic ANOVA test is then

\[H_0: \mu_1=\mu_2=\mu_3$ vs $H_a: \mu_i \ne \mu_j\]

for some i and j.

Frequentist Solution

Let’s derive the likelihood ratio test for this problem:

\[ \begin{aligned} &f(x_{11},..,x_{kn_k}\vert \pmb{\mu}) =\prod_{i=1}^k\prod_{j=1}^{n_i} \frac1{\sqrt{2\pi\sigma^2}}\exp \left\{-\frac1{2\sigma^2}\left(x_{ij}-\mu_i\right)^2 \right\} =\\ &(2\pi\sigma^2)^{-n/2} \exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu_i\right)^2 \right\}\\ &l(\pmb{\mu}\vert \pmb{x}) =\frac{n}2\log(2\pi\sigma^2) -\frac1{2\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu_i\right)^2\\ &\frac{dl(\pmb{\mu}\vert \pmb{x})}{d\mu_i} = \frac1{\sigma^2}\sum_{j=1}^{n_i}\left(x_{ij}-\mu_i\right)=0\\ &\hat{\mu_i}=\frac1{n_i}\sum_{j=1}^{n_i}x_{ij}=:\bar{x}{_i.}\\ &\frac{dl(\pmb{\mu}\vert \pmb{x})}{d(\sigma^2)} = -\frac{n}{\sigma^2}+\frac1{2(\sigma^2)^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu_i\right)^2=0\\ &\widehat{\sigma^2}=\frac1n \sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2 \end{aligned} \]

under the null hypothesis

\[H_0: \mu_1= ... =\mu_k=\mu\] and so

\[ \begin{aligned} &f(x_{11},..,x_{kn_k}\vert \mu) = L(\pmb{\mu},\sigma^2\vert\pmb{x})\\ &(2\pi\sigma^2)^{-n/2} = \exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu\right)^2 \right\}\\ &l(\mu\vert \pmb{x}) =\frac{n}2\log(2\pi\sigma^2) -\frac1{2\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu\right)^2\\ &\frac{dl(\mu\vert \pmb{x})}{d\mu} = \frac1{\sigma^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu\right)=0\\ &\hat{\hat{\mu}}=\frac1{n}\sum_{i=1}^k\sum_{j=1}^{n_i}x_{ij}=:\bar{x}_{..}\\ &\frac{dl(\pmb{\mu}\vert \pmb{x})}{d(\sigma^2)} = -\frac{n}{\sigma^2}+\frac1{2(\sigma^2)^2}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\mu\right)^2=0\\ &\widehat{\widehat{\sigma^2}}=\frac1n \sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2 \end{aligned} \] Now we find the likelihood ratio test statistic:

\[ \begin{aligned} &\lambda(\pmb{x}) = \frac{L(\widehat{\widehat{\mu}},\widehat{\widehat{\sigma^2}}\vert\pmb{x})}{L(\hat{\pmb{\mu}},\widehat{\sigma^2}\vert\pmb{x})}=\\ &\frac{(2\pi\widehat{\widehat{\sigma^2}})^{-n/2} \exp \left\{-\frac1{2\widehat{\widehat{\sigma^2}}}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2 \right\}}{(2\pi\widehat{\sigma^2})^{-n/2} \exp \left\{-\frac1{2\widehat{\sigma^2}}\sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2 \right\}} = \\ &\frac{(\widehat{\widehat{\sigma^2}})^{-n/2} \exp \left\{-\frac1{2\widehat{\widehat{\sigma^2}}}n\widehat{\widehat{\sigma^2}} \right\}}{(\widehat{\sigma^2})^{-n/2} \exp \left\{-\frac1{2\widehat{\sigma^2}}n\widehat{\sigma^2} \right\}} = \\ &\left(\frac{\widehat{\sigma^2}}{\widehat{\widehat{\sigma^2}}}\right)^{n/2} \end{aligned} \]

you can now see why this is called the analysis of variance although it really is a method concerned with means. Now as always

\[ \begin{aligned} &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}+\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left[\left(x_{ij}-\bar{x}_{..}\right)^2+2\left(x_{ij}-\bar{x}_{..}\right)\left(\bar{x}_{..}-\bar{x}_{i.}\right)+\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 \right] = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2+2\left(\bar{x}_{..}-\bar{x}_{i.}\right)\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)+n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2+2\left(\bar{x}_{..}-\bar{x}_{i.}\right)\left(\sum_{j=1}^{n_i}x_{ij}-n_i\bar{x}_{..}\right)+n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2+2\left(\bar{x}_{..}-\bar{x}_{i.}\right)\left(n_i\bar{x}_{i.}-n_i\bar{x}_{..}\right)+n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2-2n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2+n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2 = \\ &\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{..}\right)^2-n_i\left(\bar{x}_{..}-\bar{x}_{i.}\right)^2\\ \end{aligned} \] summing over k yields

\[\widehat{\sigma^2} = \widehat{\widehat{\sigma^2}}-\frac1n\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2\]

and so

\[ \begin{aligned} &\lambda(\pmb{x}) = \left(\frac{\widehat{\sigma^2}}{\widehat{\widehat{\sigma^2}}}\right)^{n/2} = \\ & \left(\frac{\widehat{\sigma^2}}{\widehat{\sigma^2}+\frac1n\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2}\right)^{n/2} = \\ &\left(\frac{1}{1+\frac{\frac1n\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2}{\widehat{\sigma^2}}}\right)^{n/2} = \\ &\left(\frac{1}{1+\frac{\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2}{ \sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2}}\right)^{n/2} \end{aligned} \] and it can be shown that \(\lambda(\pmb{x})\) is large if and only if

\[\frac{\sum_{i=1}^kn_i\left(\bar{x}_{i.}-\bar{x}_{..}\right)^2}{ \sum_{i=1}^k\sum_{j=1}^{n_i}\left(x_{ij}-\bar{x}_{i.}\right)^2}\] is large.

In the numerator we have an estimate of the variance between the group means and the overall mean, and in the denominator an estimate of the variance within the groups. It is easy to show that this test statistic (when properly scaled) has an F distribution with k-1 and n-k degrees of freedom.

Let’s find all the relevant numbers for our data set:

x.. <- mean(mothers$Length)
sum((mothers$Length-x..)^2)
## [1] 1066.955
round(x.., 2)
## [1] 49.55
n <- tapply(mothers$Length, mothers$Status, length)
n
##       Drug Free First Trimester      Throughout 
##              39              19              36
x. <- tapply(mothers$Length, mothers$Status, mean)
round(x., 2)
##       Drug Free First Trimester      Throughout 
##            51.1            49.3            48.0
num <- sum(n*(x.-x..)^2)
num
## [1] 181.3749
residuals <- 0*mothers$Length
st <- unique(mothers$Status)
for(i in 1:3)
  residuals[mothers$Status==st[i]] <- 
  mothers$Length[mothers$Status==st[i]]-x.[i]
denom <- sum(residuals^2)
denom
## [1] 885.58
mean.square <- c(num/2, denom/91)
round(mean.square, 2)
## [1] 90.69  9.73
Fs <- mean.square[1]/mean.square[2]
round(Fs, 3)
## [1] 9.319
1-pf(Fs, 2, 91)
## [1] 0.0002080747

now the information is usually summarized in an ANOVA table:

df <- data.frame(DF=c(2, 9),
       SS=c(181.375, 885.58),
       Mean=c(90.69, 9.73),
       F=c(9.318, ""),
       p.value=c(0.002, ""))
rownames(df) <- c("Status", "Residuals")
kable.nice(df)
DF SS Mean F p.value
Status 2 181.375 90.69 9.318 0.002
Residuals 9 885.580 9.73

or of course we can use R:

summary(aov(Length~Status, data=mothers))
##             Df Sum Sq Mean Sq F value   Pr(>F)
## Status       2  181.4   90.69   9.319 0.000208
## Residuals   91  885.6    9.73

The basic F test is rarely of great interest, ANOVA becomes more interesting when we test more specific hypotheses.

A contrast is an expression of the form

\(\sum a_i \mu_i\)

where \((a_1,..,a_k)\) is such that \(\sum a_i=0\).

With this notation we can write many interesting hypotheses:

  • basic F test:

\(H_0: \sum a_i\mu_i =0\)

for all \((a_1,..,a_k)\) is such that \(\sum a_i=0\)

  • multiple comparison test:

\(H_0: \sum a_i \mu_i =0\) for (1,-1,0) (1,0,-1) and (0,1,-1)

  • a test of interest in a specific situation:

\(H_0: \sum a_i \mu _i =0\) for (1/2,1/2,1)

tests whether

\((\mu_1+ \mu_2)/2= \mu_3\)

To test \(H_0: \sum a_i \mu_i=0\) vs \(H_a: \sum a_i \mu_i \ne 0\) use the test statistic

\[T=\frac{\vert \sum_{i=1}^k a_i\bar{x}_{i.}\vert}{s_p\sqrt{\sum_{i=1}^k a_i^2/n_i}}\]

where

\[s_p^2=\frac1{n-k}\sum_j\sum_i (x_{ij}-\bar{x}_{i.})^2\] then \(T\sim t(n-k)\)

Example (7.4.2)

We have

\[H_0: \sum a_i \mu _i =0\text{ vs }H_a: \sum a_i \mu_i \ne 0\]

for a=(0,1,-1)

so

a <- c(0, 1, -1)
Ts <- abs(sum(x.*a))/sqrt(mean.square[2]*(1/n[2]+1/n[3]))
out <- round(c(Ts, 1-pt(Ts, sum(n)-2)), 3)
names(out) <- c("T", "p-value")
out
##       T p-value 
##   1.470   0.073

One important point to remember here is that the classical ANOVA method is just a straightforward application of the likelihood ratio test.

Bayesian Inference

This will of course always depend on the priors on \((\mu_1,.., \mu_k, \sigma)\). If we choose noninfomative priors, for example proportional to \(1/\sigma^2\), then we recover essentially the classical ANOVA above.