We have data on the number of shoes sold by year and type:
kable.nice(head(shoesales), do.row.names = FALSE)
Sales | Year | Type |
---|---|---|
1539 | 1 | Mens |
12984 | 1 | Kids |
25809 | 1 | Ladies |
5742 | 2 | Mens |
30058 | 2 | Kids |
34764 | 2 | Ladies |
There are two ways to look at this problem:
as a regression problem with response Sales and predictors Year and Type, where Type is a categorical variable, usually called a dummy variable.
as a one-way ANOVA problem with some additional information, called a covariate. This is then called Analysis of Covariance ANCOVA.
From the above it is clear that we have here a blend of regression and ANOVA.
In order to study this as a regression problem we first need to code the categorical variable. The obvious way to do this is to assign numbers, for example Mens=1, Kids=2 and Ladies=3. This however is usually a bad idea because it does two things:
The better way to do this is to introduce two dummy variables:
so if d1+d2=0 it has to be Ladies.
d1=ifelse(shoesales$Type=="Mens", 1, 0)
d2=ifelse(shoesales$Type=="Kids", 1, 0)
y=shoesales$Sales
X=data.frame(Sales=y, Year=shoesales$Year, d1=d1, d2=d2)
fit.shoe1=lm(Sales~., data=X)
summary(fit.shoe1)
##
## Call:
## lm(formula = Sales ~ ., data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12963.7 -3433.5 -469.7 3349.1 22146.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28416.0 1917.8 14.817 < 2e-16
## Year 1551.6 115.4 13.440 < 2e-16
## d1 -26986.9 1875.7 -14.388 < 2e-16
## d2 -14212.4 1875.7 -7.577 1.65e-10
##
## Residual standard error: 6361 on 65 degrees of freedom
## Multiple R-squared: 0.8565, Adjusted R-squared: 0.8498
## F-statistic: 129.3 on 3 and 65 DF, p-value: < 2.2e-16
However, notice the following
x=seq(min(shoesales$Year), max(shoesales$Year), length=100)
z1=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x+coef(fit.shoe1)[3]
z2=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x+coef(fit.shoe1)[4]
z3=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x
df=data.frame(Year=rep(x, 3),
Sales=c(z1, z2, z3),
Type=rep(c("Mens", "Kids", "Ladies"), each=100))
ggplot(data=df, aes(Year, Sales, color=Type)) +
geom_line()
and so we see that this fits parallel lines!
Here is how to fit separate lines for each group:
fit.shoe2=lm(Sales~(.)^2, data=X)
summary(fit.shoe2)
##
## Call:
## lm(formula = Sales ~ (.)^2, data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11588.7 -3433.0 -256.7 2947.3 16121.3
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21185.6 2443.6 8.670 2.42e-12
## Year 2154.1 178.2 12.087 < 2e-16
## d1 -14185.1 3455.7 -4.105 0.000119
## d2 -5322.9 3455.7 -1.540 0.128490
## Year:d1 -1066.8 252.0 -4.233 7.64e-05
## Year:d2 -740.8 252.0 -2.939 0.004594
## d1:d2 NA NA NA NA
##
## Residual standard error: 5669 on 63 degrees of freedom
## Multiple R-squared: 0.8895, Adjusted R-squared: 0.8807
## F-statistic: 101.4 on 5 and 63 DF, p-value: < 2.2e-16
z1=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x+coef(fit.shoe2)[3]+coef(fit.shoe2)[3]+coef(fit.shoe2)[5]*x
z2=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x+coef(fit.shoe2)[4]+coef(fit.shoe2)[5]*x
z3=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x
df=data.frame(Year=rep(x, 3),
Sales=c(z1, z2, z3),
Type=rep(c("Mens", "Kids", "Ladies"), each=100))
ggplot(data=df, aes(Year, Sales, color=Type)) +
geom_line()
The model can be written as
\[\pmb{y=Z\alpha+X\beta+\epsilon}\]
where \(\pmb{Z}\) contains 0’s and 1’s, \(\pmb{\alpha}\) contains \(\mu\) and parameters such as \(\alpha_i\), \(\beta_j\) and \(\gamma_{ij}\) representing factors and interactions; \(\pmb{X}\) a matrix of covariate values and \(\pmb{\beta}\) their coefficients.
For the shoesales data we have
\[ \pmb{y=Z\alpha+X\beta+\epsilon} =\\ \begin{pmatrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0\\ \vdots & \vdots & \vdots & \vdots \\ 1 & 1 & 0 & 0\\ 1 & 0 & 1 & 0 \\ \vdots & \vdots & \vdots \\ 1 & 0 & 1& 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ \vdots & \vdots & \vdots \\ 1 & 0 & 0& 1 \\ \end{pmatrix} \begin{pmatrix} \mu \\ \alpha_1 \\ \alpha_2 \\\alpha_3 \end{pmatrix} + \begin{pmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{1n_1}\\x_{21}\\ \vdots \\ x_{3n_3} \end{pmatrix}\beta \]
We will assume that \(\pmb{Z}\) is less than full-rank and that \(\pmb{X}\) is full-rank. We can withe the model in the form
\[ \begin{aligned} &\pmb{y=Z\alpha+X\beta+\epsilon} = \\ &\begin{pmatrix} \pmb{Z} & \pmb{X}\end{pmatrix} \begin{pmatrix} \pmb{\alpha} \\ \pmb{\beta}\end{pmatrix}+\pmb{\epsilon} = \\ &\pmb{U\theta+\epsilon} \end{aligned} \]
The normal equations are
\[\pmb{U'U\hat{\theta}=U'y}\]
which is equal to
\[ \begin{pmatrix} \pmb{Z'Z} & \pmb{Z'X} \\ \pmb{X'Z} & \pmb{X'X} \end{pmatrix} \begin{pmatrix} \pmb{\hat{\alpha}} \\ \pmb{\hat{\beta}} \end{pmatrix}= \begin{pmatrix} \pmb{Z'y} \\ \pmb{X'y} \end{pmatrix} \]
which is the two equation
\[ \begin{aligned} &\pmb{Z'Z\hat{\alpha}} + \pmb{Z'X\hat{\beta}} = \pmb{Z'y}\\ &\pmb{X'Z\hat{\alpha}} + \pmb{X'X\hat{\beta}} = \pmb{X'y} \end{aligned} \]
using a generalized inverse for \(\pmb{Z'Z}\) we find
\[\pmb{\hat{\alpha}} = (\pmb{Z'Z})^{-}\pmb{Z'y} - (\pmb{Z'Z})^{-}\pmb{Z'X\hat{\beta}} = \pmb{\hat{\alpha}}_0 - (\pmb{Z'Z})^{-}\pmb{Z'X\hat{\beta}}\]
where \(\pmb{\hat{\alpha}}_0\) is a solution of the normal equations without the covariates.
To find \(\pmb{\hat{\beta}}\) we substitute this into the second equation:
\[ \begin{aligned} &\pmb{X'Z}\left[(\pmb{Z'Z})^{-}\pmb{Z'y} - (\pmb{Z'Z})^{-}\pmb{Z'X\hat{\beta}}\right] + \pmb{X'X\hat{\beta}} = \pmb{X'y} \\ &\pmb{X'}\left[ \pmb{I} - \pmb{Z}(\pmb{Z'Z})^{-}\pmb{Z'} \right]\pmb{X}\pmb{\hat{\beta}} = \pmb{X'y} - \pmb{X'Z}(\pmb{Z'Z})^{-}\pmb{Z'y} \\ &\pmb{X'}\left[ \pmb{I} - \pmb{P} \right]\pmb{X}\pmb{\hat{\beta}} = \pmb{X'}(\pmb{I-P})\pmb{y} \\ \end{aligned} \]
where \(\pmb{P=Z}(\pmb{Z'Z})^{-}\pmb{Z'}\). Because \(\pmb{X}\) is a set of covariates it is (almost always) independent of \(\pmb{Z}\), and so \(\pmb{X'}\left[ \pmb{I} - \pmb{P} \right]\pmb{X}\) is non-singular. So we find
\[\pmb{\hat{\beta}} = \left[\pmb{X'}\left( \pmb{I} - \pmb{P} \right)\pmb{X}\right]^{-1}\pmb{X'(I-P)y} = \pmb{E}_{xx}^{-1}\pmb{e}_{xy}\]
Let’s find the estimates for the shoesales data. For that we need to reorder the data as well
shoesales=shoesales[order(shoesales$Type), ]
X=matrix(as.numeric(shoesales[ ,"Year"]),ncol=1)
y=matrix(as.numeric(shoesales[ ,"Sales"]),ncol=1)
ni=table(shoesales$Type)
ni
##
## Kids Ladies Mens
## 23 23 23
n=sum(ni)
Z=matrix(0, n, 4)
Z[, 1]=1
Z[1:ni[1], 2]=1
Z[(1+ni[1]):(ni[1]+ni[2]), 3]=1
Z[(1+ni[1]+ni[2]):n, 4]=1
library(MASS)
gZZ=ginv(t(Z)%*%Z)
P=Z%*%gZZ%*%t(Z)
betahat=solve(t(X)%*%(diag(n)-P)%*%X)%*%t(X)%*%(diag(n)-P)%*%y
alpha0hat=gZZ%*%t(Z)%*%y
c(alpha0hat)
## [1] 24976.239 7846.109 22058.500 -4928.370
alphahat=alpha0hat-gZZ%*%t(Z)%*%X%*%betahat
c(alphahat, betahat)
## [1] 11012.182 3191.423 17403.814 -9583.055 1551.562
Now for the sum of squares we find
\[ \begin{aligned} &\text{SSE} = \text{SSE}_{y.x} = \pmb{y'y-\hat{\theta}U'y}=\\ &\pmb{y'y}- \begin{pmatrix} \pmb{\hat{\alpha}'} & \pmb{\hat{\beta}'} \end{pmatrix} \begin{pmatrix} \pmb{Z'y} \\ \pmb{X'y} \end{pmatrix} = \\ &\pmb{y'y}-\pmb{\hat{\alpha}'Z'y} -\pmb{\hat{\beta}'X'y} = \\ &\pmb{y'y}- \left[\pmb{\hat{\alpha}}_0' - \pmb{\hat{\beta}'X'Z}(\pmb{Z'Z})^{-} \right] \pmb{Z'y} -\pmb{\hat{\beta}'X'y} = \\ &\pmb{y'y}- \pmb{\hat{\alpha}}_0'\pmb{Z'y} + \pmb{\hat{\beta}'X'Z}(\pmb{Z'Z})^{-} \pmb{Z'y} -\pmb{\hat{\beta}'X'y} = \\ &\pmb{y'y}- \pmb{\hat{\alpha}}_0'\pmb{Z'y} - \pmb{\hat{\beta}'X'}\left[\pmb{I}-\pmb{Z}(\pmb{Z'Z})^{-} \pmb{Z'} \right]\pmb{y} = \\ &\text{SSE}_y - \pmb{\hat{\beta}'X'}\left[\pmb{I}-\pmb{P} \right]\pmb{y} \end{aligned} \]
where SSEy is the error sum of squares for the ANOVA model without covariates.
Setting \(e_{yy}=\text{SSE}_y = \pmb{y'(I-P)y}\) we can also write
\[\text{SSE}_{y.x} = e_{yy}-e'_{xy}\pmb{E}_{xx}^{-1}\pmb{e}_{xy}\]
It can be shown that \(\pmb{E}_{xx}\) is positive definite, so that \(e'_{xy}\pmb{E}_{xx}^{-1}\pmb{e}_{xy}>0\), and so the use of covariates reduces the sum of squares, thereby leading to smaller variances.
Continuing the analysis of the shoesales data:
Exx=t(X)%*%(diag(n)-P)%*%X
exy=t(X)%*%(diag(n)-P)%*%y
eyy=t(y)%*%(diag(n)-P)%*%y
sse=eyy-t(exy)%*%solve(Exx)%*%exy
c(sse, eyy)
## [1] 2629827256 9938524862
As always we will now assume that \(\pmb{\epsilon}=N_n(\pmb{0},\sigma^2\pmb{I})\). The hypothesis \(H_0:\alpha_1=..=\alpha_k=0\) can be written as
\[H_0: \begin{pmatrix} \pmb{C}_1 & \pmb{0} \end{pmatrix} \begin{pmatrix} \pmb{\alpha} \\ \pmb{\beta} \end{pmatrix} =\pmb{0}\]
and we can use a general linear hypothesis test.
A test that is often of interest is \(H_0: \pmb{\beta}=0\), that is a test whether any of the covariates is useful. For a general linear hypothesis test of \(H_0: \pmb{\beta}=0\) we need \(cov(\pmb{\hat{\beta}})\), where \(\pmb{\hat{\beta}}=\left[\pmb{X'}\left( \pmb{I} - \pmb{P} \right)\pmb{X}\right]^{-1}\pmb{X'(I-P)y}\). By (4.3.19) \(\pmb{I-P}\) is idempotent we have
\[ \begin{aligned} &cov(\pmb{\hat{\beta}}) = \\ &\pmb{[X'(I-P)X}]^{-1}\pmb{X'(I-P)}\sigma^2\pmb{I(I-P)X}\pmb{[X'(I-P)X}]^{-1} = \\ &\sigma^2\pmb{[X'(I-P)X}]^{-1}\pmb{X'(I-P)}\pmb{(I-P)X}\pmb{[X'(I-P)X}]^{-1} = \\ &\sigma^2\pmb{[X'(I-P)X}]^{-1}\pmb{[X'(I-P)X]}\pmb{[X'(I-P)X}]^{-1} = \\ &\sigma^2\pmb{[X'(I-P)X}]^{-1} \end{aligned} \]
So now
\[\text{SSH}=\pmb{\hat{\beta}}\pmb{X'(I-P)X}=\pmb{e'_{xy}E_{xx}^{-1}e_{exy}}\]
ANOVA table for test of covariates in ANCOVA design
\[ \begin{array}{cccc} \text{Source} & \text{df} & \text{SS} & \text{F} \\ \hline \\ \text{Covariates} & k & \text{SSH}=\pmb{e'_{xy}E_{xx}^{-1}e_{exy}} & \frac{\text{SSH}/k}{\text{SSE}/(n-k-1)}\\ \text{Error} & n-k-1 & \text{SSE}=\pmb{e_{yy}-e'_{xy}E_{xx}^{-1}e_{xy}} & \\ \text{Total} & n-1 & \text{SST}=\pmb{e_{yy}} &\\ \hline \\ \end{array} \]
Let’s test for Year in shoesales:
k=1
ssh=t(exy)%*%solve(Exx)%*%exy
FTS = (ssh/1)/(sse/(n-k-1))
round(c(ssh, sse/c(1, n-k-1), FTS, 1-pf(FTS, 1, n-k-1)))
## [1] 7308697607 2629827256 39251153 186 0
or with R:
summary(aov(Sales~., data=shoesales))
## Df Sum Sq Mean Sq F value Pr(>F)
## Year 1 7.309e+09 7.309e+09 180.6 <2e-16
## Type 2 8.383e+09 4.192e+09 103.6 <2e-16
## Residuals 65 2.630e+09 4.046e+07
In the above we have assumed that there was no interaction between the predictors. If we suspect that there is we should run
summary(aov(Sales~(.)^2, data=shoesales))
## Df Sum Sq Mean Sq F value Pr(>F)
## Year 1 7.309e+09 7.309e+09 227.39 < 2e-16
## Type 2 8.383e+09 4.192e+09 130.41 < 2e-16
## Year:Type 2 6.049e+08 3.024e+08 9.41 0.000266
## Residuals 63 2.025e+09 3.214e+07
and indeed there such an interaction.
In the case where we have two predictors, one categorical and one continuous, and a balanced design the formulas simplify:
\[\pmb{\hat{\alpha}} = \pmb{\hat{\alpha}}_0 - (\pmb{Z'Z})^{-}\pmb{Z'X\hat{\beta}}= \begin{pmatrix} 0 \\ \bar{y}_{1.}-\hat{\beta}\bar{x}_{1.} \\ \bar{y}_{2.}-\hat{\beta}\bar{x}_{2.} \\ \vdots \\ \bar{y}_{k.}-\hat{\beta}\bar{x}_{k.} \\ \end{pmatrix}\]
\[ \begin{aligned} &\pmb{E}_{xx} = e_{xx} = \sum_{ij} (x_{ij}-\bar{x}_{i.})^2 \\ &\pmb{e}_{xy} = e_{xy} = \sum_{ij} (x_{ij}-\bar{x}_{i.})(y_{ij}-\bar{y}_{i.}) \\ &\pmb{e}_{yy} = e_{xy} = \sum_{ij} (y_{ij}-\bar{y}_{i.})^2 \\ &\hat{\beta}=\frac{e_{xy}}{e_{yy}} \\ &\text{SSE}_{y.x} = e_{yy}-e^2_{xy}/e_{xx} \end{aligned} \] and \(\text{SSE}_{y.x}\) has k(n-1)-1 degrees of freedom, where k is the number of groups and n is the number of observations per group.
ANOVA table for test of one covariate in One-Way ANOVA design
\[ \begin{array}{cccc} \text{Source} & \text{df} & \text{SS} & \text{F} \\ \hline \\ \text{Covariate} & 1 & \text{SSH}=e_{xy}^2/e_{xx} & \frac{\text{SSH}/k}{\text{SSE}/(n-k-1)}\\ \text{Error} & k(n-1)-1 & \text{SSE}=e_{yy}-e_{xy}^2/e_{xx} & \\ \text{Total} & k(n-1)-2 & \text{SST}=e_{yy} &\\ \hline \\ \end{array} \]