Example (8.1.1)

We have data on the number of shoes sold by year and type:

kable.nice(head(shoesales), do.row.names = FALSE)
Sales Year Type
1539 1 Mens
12984 1 Kids
25809 1 Ladies
5742 2 Mens
30058 2 Kids
34764 2 Ladies

There are two ways to look at this problem:

  • as a regression problem with response Sales and predictors Year and Type, where Type is a categorical variable, usually called a dummy variable.

  • as a one-way ANOVA problem with some additional information, called a covariate. This is then called Analysis of Covariance ANCOVA.

From the above it is clear that we have here a blend of regression and ANOVA.

Regression Analysis

In order to study this as a regression problem we first need to code the categorical variable. The obvious way to do this is to assign numbers, for example Mens=1, Kids=2 and Ladies=3. This however is usually a bad idea because it does two things:

  • it introduces order (Mens=1<2=Kids)
  • it introduces a scale (Kid-Mens = 2-1 = 3-2 = Ladies-Kids)

The better way to do this is to introduce two dummy variables:

  • d1 = 1 if Mens, 0 otherwise
  • d2 = 1 if Kids, 0 otherwise

so if d1+d2=0 it has to be Ladies.

Example (8.1.2)

d1=ifelse(shoesales$Type=="Mens", 1, 0)
d2=ifelse(shoesales$Type=="Kids", 1, 0)
y=shoesales$Sales
X=data.frame(Sales=y, Year=shoesales$Year, d1=d1, d2=d2)
fit.shoe1=lm(Sales~., data=X)
summary(fit.shoe1)
## 
## Call:
## lm(formula = Sales ~ ., data = X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12963.7  -3433.5   -469.7   3349.1  22146.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  28416.0     1917.8  14.817  < 2e-16
## Year          1551.6      115.4  13.440  < 2e-16
## d1          -26986.9     1875.7 -14.388  < 2e-16
## d2          -14212.4     1875.7  -7.577 1.65e-10
## 
## Residual standard error: 6361 on 65 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.8498 
## F-statistic: 129.3 on 3 and 65 DF,  p-value: < 2.2e-16

However, notice the following

x=seq(min(shoesales$Year), max(shoesales$Year), length=100)
z1=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x+coef(fit.shoe1)[3]
z2=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x+coef(fit.shoe1)[4]
z3=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x
df=data.frame(Year=rep(x, 3),
              Sales=c(z1, z2, z3),
              Type=rep(c("Mens", "Kids", "Ladies"), each=100))
ggplot(data=df, aes(Year, Sales, color=Type)) +
  geom_line()

and so we see that this fits parallel lines!

Here is how to fit separate lines for each group:

fit.shoe2=lm(Sales~(.)^2, data=X)
summary(fit.shoe2)
## 
## Call:
## lm(formula = Sales ~ (.)^2, data = X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11588.7  -3433.0   -256.7   2947.3  16121.3 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  21185.6     2443.6   8.670 2.42e-12
## Year          2154.1      178.2  12.087  < 2e-16
## d1          -14185.1     3455.7  -4.105 0.000119
## d2           -5322.9     3455.7  -1.540 0.128490
## Year:d1      -1066.8      252.0  -4.233 7.64e-05
## Year:d2       -740.8      252.0  -2.939 0.004594
## d1:d2             NA         NA      NA       NA
## 
## Residual standard error: 5669 on 63 degrees of freedom
## Multiple R-squared:  0.8895, Adjusted R-squared:  0.8807 
## F-statistic: 101.4 on 5 and 63 DF,  p-value: < 2.2e-16
z1=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x+coef(fit.shoe2)[3]+coef(fit.shoe2)[3]+coef(fit.shoe2)[5]*x
z2=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x+coef(fit.shoe2)[4]+coef(fit.shoe2)[5]*x
z3=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x
df=data.frame(Year=rep(x, 3),
              Sales=c(z1, z2, z3),
              Type=rep(c("Mens", "Kids", "Ladies"), each=100))
ggplot(data=df, aes(Year, Sales, color=Type)) +
  geom_line()

ANCOVA Model

The model can be written as

\[\pmb{y=Z\alpha+X\beta+\epsilon}\]

where \(\pmb{Z}\) contains 0’s and 1’s, \(\pmb{\alpha}\) contains \(\mu\) and parameters such as \(\alpha_i\), \(\beta_j\) and \(\gamma_{ij}\) representing factors and interactions; \(\pmb{X}\) a matrix of covariate values and \(\pmb{\beta}\) their coefficients.

Example (8.1.3)

For the shoesales data we have

\[ \pmb{y=Z\alpha+X\beta+\epsilon} =\\ \begin{pmatrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 1 & 0 & 0\\ 1 & 0 & 1 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 0 & 1& 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ \vdots & \vdots & \vdots \\ 1 & 0 & 0& 1 \\ \end{pmatrix} \begin{pmatrix} \mu \\ \alpha_1 \\ \alpha_2 \\\alpha_3 \end{pmatrix} + \begin{pmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{1n_1}\\x_{21}\\ \vdots \\ x_{3n_3} \end{pmatrix}\beta \]

Estimation

We will assume that \(\pmb{Z}\) is less than full-rank and that \(\pmb{X}\) is full-rank. We can withe the model in the form

\[ \begin{aligned} &\pmb{y=Z\alpha+X\beta+\epsilon} = \\ &\begin{pmatrix} \pmb{Z} & \pmb{X}\end{pmatrix} \begin{pmatrix} \pmb{\alpha} \\ \pmb{\beta}\end{pmatrix}+\pmb{\epsilon} = \\ &\pmb{U\theta+\epsilon} \end{aligned} \]

The normal equations are

\[\pmb{U'U\hat{\theta}=U'y}\]

which is equal to

\[ \begin{pmatrix} \pmb{Z'Z} & \pmb{Z'X} \\ \pmb{X'Z} & \pmb{X'X} \end{pmatrix} \begin{pmatrix} \pmb{\hat{\alpha}} \\ \pmb{\hat{\beta}} \end{pmatrix}= \begin{pmatrix} \pmb{Z'y} \\ \pmb{X'y} \end{pmatrix} \]

which is the two equation

\[ \begin{aligned} &\pmb{Z'Z\hat{\alpha}} + \pmb{Z'X\hat{\beta}} = \pmb{Z'y}\\ &\pmb{X'Z\hat{\alpha}} + \pmb{X'X\hat{\beta}} = \pmb{X'y} \end{aligned} \]

using a generalized inverse for \(\pmb{Z'Z}\) we find

\[\pmb{\hat{\alpha}} = (\pmb{Z'Z})^{-}\pmb{Z'y} - (\pmb{Z'Z})^{-}\pmb{Z'X\hat{\beta}} = \pmb{\hat{\alpha}}_0 - (\pmb{Z'Z})^{-}\pmb{Z'X\hat{\beta}}\]

where \(\pmb{\hat{\alpha}}_0\) is a solution of the normal equations without the covariates.

To find \(\pmb{\hat{\beta}}\) we substitute this into the second equation:

\[ \begin{aligned} &\pmb{X'Z}\left[(\pmb{Z'Z})^{-}\pmb{Z'y} - (\pmb{Z'Z})^{-}\pmb{Z'X\hat{\beta}}\right] + \pmb{X'X\hat{\beta}} = \pmb{X'y} \\ &\pmb{X'}\left[ \pmb{I} - \pmb{Z}(\pmb{Z'Z})^{-}\pmb{Z'} \right]\pmb{X}\pmb{\hat{\beta}} = \pmb{X'y} - \pmb{X'Z}(\pmb{Z'Z})^{-}\pmb{Z'y} \\ &\pmb{X'}\left[ \pmb{I} - \pmb{P} \right]\pmb{X}\pmb{\hat{\beta}} = \pmb{X'}(\pmb{I-P})\pmb{y} \\ \end{aligned} \]

where \(\pmb{P=Z}(\pmb{Z'Z})^{-}\pmb{Z'}\). Because \(\pmb{X}\) is a set of covariates it is (almost always) independent of \(\pmb{Z}\), and so \(\pmb{X'}\left[ \pmb{I} - \pmb{P} \right]\pmb{X}\) is non-singular. So we find

\[\pmb{\hat{\beta}} = \left[\pmb{X'}\left( \pmb{I} - \pmb{P} \right)\pmb{X}\right]^{-1}\pmb{X'(I-P)y} = \pmb{E}_{xx}^{-1}\pmb{e}_{xy}\]

Example (8.1.4)

Let’s find the estimates for the shoesales data. For that we need to reorder the data as well

shoesales=shoesales[order(shoesales$Type), ]
X=matrix(as.numeric(shoesales[ ,"Year"]),ncol=1)
y=matrix(as.numeric(shoesales[ ,"Sales"]),ncol=1)
ni=table(shoesales$Type)
ni
## 
##   Kids Ladies   Mens 
##     23     23     23
n=sum(ni)
Z=matrix(0, n, 4)
Z[, 1]=1
Z[1:ni[1], 2]=1
Z[(1+ni[1]):(ni[1]+ni[2]), 3]=1
Z[(1+ni[1]+ni[2]):n, 4]=1
library(MASS)
gZZ=ginv(t(Z)%*%Z)
P=Z%*%gZZ%*%t(Z)
betahat=solve(t(X)%*%(diag(n)-P)%*%X)%*%t(X)%*%(diag(n)-P)%*%y
alpha0hat=gZZ%*%t(Z)%*%y
c(alpha0hat)
## [1] 24976.239  7846.109 22058.500 -4928.370
alphahat=alpha0hat-gZZ%*%t(Z)%*%X%*%betahat
c(alphahat, betahat)
## [1] 11012.182  3191.423 17403.814 -9583.055  1551.562

Now for the sum of squares we find

\[ \begin{aligned} &\text{SSE} = \text{SSE}_{y.x} = \pmb{y'y-\hat{\theta}U'y}=\\ &\pmb{y'y}- \begin{pmatrix} \pmb{\hat{\alpha}'} & \pmb{\hat{\beta}'} \end{pmatrix} \begin{pmatrix} \pmb{Z'y} \\ \pmb{X'y} \end{pmatrix} = \\ &\pmb{y'y}-\pmb{\hat{\alpha}'Z'y} -\pmb{\hat{\beta}'X'y} = \\ &\pmb{y'y}- \left[\pmb{\hat{\alpha}}_0' - \pmb{\hat{\beta}'X'Z}(\pmb{Z'Z})^{-} \right] \pmb{Z'y} -\pmb{\hat{\beta}'X'y} = \\ &\pmb{y'y}- \pmb{\hat{\alpha}}_0'\pmb{Z'y} + \pmb{\hat{\beta}'X'Z}(\pmb{Z'Z})^{-} \pmb{Z'y} -\pmb{\hat{\beta}'X'y} = \\ &\pmb{y'y}- \pmb{\hat{\alpha}}_0'\pmb{Z'y} - \pmb{\hat{\beta}'X'}\left[\pmb{I}-\pmb{Z}(\pmb{Z'Z})^{-} \pmb{Z'} \right]\pmb{y} = \\ &\text{SSE}_y - \pmb{\hat{\beta}'X'}\left[\pmb{I}-\pmb{P} \right]\pmb{y} \end{aligned} \]

where SSEy is the error sum of squares for the ANOVA model without covariates.

Setting \(e_{yy}=\text{SSE}_y = \pmb{y'(I-P)y}\) we can also write

\[\text{SSE}_{y.x} = e_{yy}-e'_{xy}\pmb{E}_{xx}^{-1}\pmb{e}_{xy}\]

It can be shown that \(\pmb{E}_{xx}\) is positive definite, so that \(e'_{xy}\pmb{E}_{xx}^{-1}\pmb{e}_{xy}>0\), and so the use of covariates reduces the sum of squares, thereby leading to smaller variances.

Example (8.1.5)

Continuing the analysis of the shoesales data:

Exx=t(X)%*%(diag(n)-P)%*%X
exy=t(X)%*%(diag(n)-P)%*%y
eyy=t(y)%*%(diag(n)-P)%*%y
sse=eyy-t(exy)%*%solve(Exx)%*%exy
c(sse, eyy)
## [1] 2629827256 9938524862

Testing Hypotheses

As always we will now assume that \(\pmb{\epsilon}=N_n(\pmb{0},\sigma^2\pmb{I})\). The hypothesis \(H_0:\alpha_1=..=\alpha_k=0\) can be written as

\[H_0: \begin{pmatrix} \pmb{C}_1 & \pmb{0} \end{pmatrix} \begin{pmatrix} \pmb{\alpha} \\ \pmb{\beta} \end{pmatrix} =\pmb{0}\]

and we can use a general linear hypothesis test.

A test that is often of interest is \(H_0: \pmb{\beta}=0\), that is a test whether any of the covariates is useful. For a general linear hypothesis test of \(H_0: \pmb{\beta}=0\) we need \(cov(\pmb{\hat{\beta}})\), where \(\pmb{\hat{\beta}}=\left[\pmb{X'}\left( \pmb{I} - \pmb{P} \right)\pmb{X}\right]^{-1}\pmb{X'(I-P)y}\). By (4.3.19) \(\pmb{I-P}\) is idempotent we have

\[ \begin{aligned} &cov(\pmb{\hat{\beta}}) = \\ &\pmb{[X'(I-P)X}]^{-1}\pmb{X'(I-P)}\sigma^2\pmb{I(I-P)X}\pmb{[X'(I-P)X}]^{-1} = \\ &\sigma^2\pmb{[X'(I-P)X}]^{-1}\pmb{X'(I-P)}\pmb{(I-P)X}\pmb{[X'(I-P)X}]^{-1} = \\ &\sigma^2\pmb{[X'(I-P)X}]^{-1}\pmb{[X'(I-P)X]}\pmb{[X'(I-P)X}]^{-1} = \\ &\sigma^2\pmb{[X'(I-P)X}]^{-1} \end{aligned} \]

So now

\[\text{SSH}=\pmb{\hat{\beta}}\pmb{X'(I-P)X}=\pmb{e'_{xy}E_{xx}^{-1}e_{exy}}\]

ANOVA table for test of covariates in ANCOVA design

\[ \begin{array}{cccc} \text{Source} & \text{df} & \text{SS} & \text{F} \\ \hline \\ \text{Covariates} & k & \text{SSH}=\pmb{e'_{xy}E_{xx}^{-1}e_{exy}} & \frac{\text{SSH}/k}{\text{SSE}/(n-k-1)}\\ \text{Error} & n-k-1 & \text{SSE}=\pmb{e_{yy}-e'_{xy}E_{xx}^{-1}e_{xy}} & \\ \text{Total} & n-1 & \text{SST}=\pmb{e_{yy}} &\\ \hline \\ \end{array} \]

Example (8.1.6)

Let’s test for Year in shoesales:

k=1
ssh=t(exy)%*%solve(Exx)%*%exy
FTS = (ssh/1)/(sse/(n-k-1))
round(c(ssh, sse/c(1, n-k-1), FTS, 1-pf(FTS, 1, n-k-1)))
## [1] 7308697607 2629827256   39251153        186          0

or with R:

summary(aov(Sales~., data=shoesales))
##             Df    Sum Sq   Mean Sq F value Pr(>F)
## Year         1 7.309e+09 7.309e+09   180.6 <2e-16
## Type         2 8.383e+09 4.192e+09   103.6 <2e-16
## Residuals   65 2.630e+09 4.046e+07

In the above we have assumed that there was no interaction between the predictors. If we suspect that there is we should run

summary(aov(Sales~(.)^2, data=shoesales))
##             Df    Sum Sq   Mean Sq F value   Pr(>F)
## Year         1 7.309e+09 7.309e+09  227.39  < 2e-16
## Type         2 8.383e+09 4.192e+09  130.41  < 2e-16
## Year:Type    2 6.049e+08 3.024e+08    9.41 0.000266
## Residuals   63 2.025e+09 3.214e+07

and indeed there such an interaction.

Balanced One-Way Model with One Covariate

In the case where we have two predictors, one categorical and one continuous, and a balanced design the formulas simplify:

\[\pmb{\hat{\alpha}} = \pmb{\hat{\alpha}}_0 - (\pmb{Z'Z})^{-}\pmb{Z'X\hat{\beta}}= \begin{pmatrix} 0 \\ \bar{y}_{1.}-\hat{\beta}\bar{x}_{1.} \\ \bar{y}_{2.}-\hat{\beta}\bar{x}_{2.} \\ \vdots \\ \bar{y}_{k.}-\hat{\beta}\bar{x}_{k.} \\ \end{pmatrix}\]

\[ \begin{aligned} &\pmb{E}_{xx} = e_{xx} = \sum_{ij} (x_{ij}-\bar{x}_{i.})^2 \\ &\pmb{e}_{xy} = e_{xy} = \sum_{ij} (x_{ij}-\bar{x}_{i.})(y_{ij}-\bar{y}_{i.}) \\ &\pmb{e}_{yy} = e_{xy} = \sum_{ij} (y_{ij}-\bar{y}_{i.})^2 \\ &\hat{\beta}=\frac{e_{xy}}{e_{yy}} \\ &\text{SSE}_{y.x} = e_{yy}-e^2_{xy}/e_{xx} \end{aligned} \] and \(\text{SSE}_{y.x}\) has k(n-1)-1 degrees of freedom, where k is the number of groups and n is the number of observations per group.

ANOVA table for test of one covariate in One-Way ANOVA design

\[ \begin{array}{cccc} \text{Source} & \text{df} & \text{SS} & \text{F} \\ \hline \\ \text{Covariate} & 1 & \text{SSH}=e_{xy}^2/e_{xx} & \frac{\text{SSH}/k}{\text{SSE}/(n-k-1)}\\ \text{Error} & k(n-1)-1 & \text{SSE}=e_{yy}-e_{xy}^2/e_{xx} & \\ \text{Total} & k(n-1)-2 & \text{SST}=e_{yy} &\\ \hline \\ \end{array} \]