Loading [MathJax]/jax/output/HTML-CSS/jax.js

Example (8.1.1)

We have data on the number of shoes sold by year and type:

kable.nice(head(shoesales), do.row.names = FALSE)
Sales Year Type
1539 1 Mens
12984 1 Kids
25809 1 Ladies
5742 2 Mens
30058 2 Kids
34764 2 Ladies

There are two ways to look at this problem:

  • as a regression problem with response Sales and predictors Year and Type, where Type is a categorical variable, usually called a dummy variable.

  • as a one-way ANOVA problem with some additional information, called a covariate. This is then called Analysis of Covariance ANCOVA.

From the above it is clear that we have here a blend of regression and ANOVA.

Regression Analysis

In order to study this as a regression problem we first need to code the categorical variable. The obvious way to do this is to assign numbers, for example Mens=1, Kids=2 and Ladies=3. This however is usually a bad idea because it does two things:

  • it introduces order (Mens=1<2=Kids)
  • it introduces a scale (Kid-Mens = 2-1 = 3-2 = Ladies-Kids)

The better way to do this is to introduce two dummy variables:

  • d1 = 1 if Mens, 0 otherwise
  • d2 = 1 if Kids, 0 otherwise

so if d1+d2=0 it has to be Ladies.

Example (8.1.2)

d1=ifelse(shoesales$Type=="Mens", 1, 0)
d2=ifelse(shoesales$Type=="Kids", 1, 0)
y=shoesales$Sales
X=data.frame(Sales=y, Year=shoesales$Year, d1=d1, d2=d2)
fit.shoe1=lm(Sales~., data=X)
summary(fit.shoe1)
## 
## Call:
## lm(formula = Sales ~ ., data = X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12963.7  -3433.5   -469.7   3349.1  22146.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  28416.0     1917.8  14.817  < 2e-16
## Year          1551.6      115.4  13.440  < 2e-16
## d1          -26986.9     1875.7 -14.388  < 2e-16
## d2          -14212.4     1875.7  -7.577 1.65e-10
## 
## Residual standard error: 6361 on 65 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.8498 
## F-statistic: 129.3 on 3 and 65 DF,  p-value: < 2.2e-16

However, notice the following

x=seq(min(shoesales$Year), max(shoesales$Year), length=100)
z1=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x+coef(fit.shoe1)[3]
z2=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x+coef(fit.shoe1)[4]
z3=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x
df=data.frame(Year=rep(x, 3),
              Sales=c(z1, z2, z3),
              Type=rep(c("Mens", "Kids", "Ladies"), each=100))
ggplot(data=df, aes(Year, Sales, color=Type)) +
  geom_line()

and so we see that this fits parallel lines!

Here is how to fit separate lines for each group:

fit.shoe2=lm(Sales~(.)^2, data=X)
summary(fit.shoe2)
## 
## Call:
## lm(formula = Sales ~ (.)^2, data = X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11588.7  -3433.0   -256.7   2947.3  16121.3 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  21185.6     2443.6   8.670 2.42e-12
## Year          2154.1      178.2  12.087  < 2e-16
## d1          -14185.1     3455.7  -4.105 0.000119
## d2           -5322.9     3455.7  -1.540 0.128490
## Year:d1      -1066.8      252.0  -4.233 7.64e-05
## Year:d2       -740.8      252.0  -2.939 0.004594
## d1:d2             NA         NA      NA       NA
## 
## Residual standard error: 5669 on 63 degrees of freedom
## Multiple R-squared:  0.8895, Adjusted R-squared:  0.8807 
## F-statistic: 101.4 on 5 and 63 DF,  p-value: < 2.2e-16
z1=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x+coef(fit.shoe2)[3]+coef(fit.shoe2)[3]+coef(fit.shoe2)[5]*x
z2=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x+coef(fit.shoe2)[4]+coef(fit.shoe2)[5]*x
z3=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x
df=data.frame(Year=rep(x, 3),
              Sales=c(z1, z2, z3),
              Type=rep(c("Mens", "Kids", "Ladies"), each=100))
ggplot(data=df, aes(Year, Sales, color=Type)) +
  geom_line()

ANCOVA Model

The model can be written as

y=Zα+Xβ+ϵy=Zα+Xβ+ϵ

where ZZ contains 0’s and 1’s, αα contains μ and parameters such as αi, βj and γij representing factors and interactions; XX a matrix of covariate values and ββ their coefficients.

Example (8.1.3)

For the shoesales data we have

y=Zα+Xβ+ϵy=Zα+Xβ+ϵ=(11001100110010101010101010011001)(μα1α2α3)+(x11x12x1n1x21x3n3)β

Estimation

We will assume that ZZ is less than full-rank and that XX is full-rank. We can withe the model in the form

y=Zα+Xβ+ϵy=Zα+Xβ+ϵ=(ZZXX)(ααββ)+ϵϵ=Uθ+ϵUθ+ϵ

The normal equations are

UUˆθ=UyUUˆθ=Uy

which is equal to

(ZZZZZXZXXZXZXXXX)(ˆαˆαˆβˆβ)=(ZyZyXyXy)

which is the two equation

ZZˆαZZˆα+ZXˆβZXˆβ=ZyZyXZˆαXZˆα+XXˆβXXˆβ=XyXy

using a generalized inverse for ZZZZ we find

ˆαˆα=(ZZZZ)ZyZy(ZZZZ)ZXˆβZXˆβ=ˆαˆα0(ZZZZ)ZXˆβZXˆβ

where ˆαˆα0 is a solution of the normal equations without the covariates.

To find ˆβˆβ we substitute this into the second equation:

XZXZ[(ZZZZ)ZyZy(ZZZZ)ZXˆβZXˆβ]+XXˆβXXˆβ=XyXyXX[IIZZ(ZZZZ)ZZ]XXˆβˆβ=XyXyXZXZ(ZZZZ)ZyZyXX[IIPP]XXˆβˆβ=XX(IPIP)yy

where P=ZP=Z(ZZZZ)ZZ. Because XX is a set of covariates it is (almost always) independent of ZZ, and so XX[IIPP]XX is non-singular. So we find

ˆβˆβ=[XX(IIPP)XX]1X(IP)yX(IP)y=EE1xxeexy

Example (8.1.4)

Let’s find the estimates for the shoesales data. For that we need to reorder the data as well

shoesales=shoesales[order(shoesales$Type), ]
X=matrix(as.numeric(shoesales[ ,"Year"]),ncol=1)
y=matrix(as.numeric(shoesales[ ,"Sales"]),ncol=1)
ni=table(shoesales$Type)
ni
## 
##   Kids Ladies   Mens 
##     23     23     23
n=sum(ni)
Z=matrix(0, n, 4)
Z[, 1]=1
Z[1:ni[1], 2]=1
Z[(1+ni[1]):(ni[1]+ni[2]), 3]=1
Z[(1+ni[1]+ni[2]):n, 4]=1
library(MASS)
gZZ=ginv(t(Z)%*%Z)
P=Z%*%gZZ%*%t(Z)
betahat=solve(t(X)%*%(diag(n)-P)%*%X)%*%t(X)%*%(diag(n)-P)%*%y
alpha0hat=gZZ%*%t(Z)%*%y
c(alpha0hat)
## [1] 24976.239  7846.109 22058.500 -4928.370
alphahat=alpha0hat-gZZ%*%t(Z)%*%X%*%betahat
c(alphahat, betahat)
## [1] 11012.182  3191.423 17403.814 -9583.055  1551.562

Now for the sum of squares we find

SSE=SSEy.x=yyˆθUyyyˆθUy=yyyy(ˆαˆαˆβˆβ)(ZyZyXyXy)=yyyyˆαZyˆαZyˆβXyˆβXy=yyyy[ˆαˆα0ˆβXZˆβXZ(ZZZZ)]ZyZyˆβXyˆβXy=yyyyˆαˆα0ZyZy+ˆβXZˆβXZ(ZZZZ)ZyZyˆβXyˆβXy=yyyyˆαˆα0ZyZyˆβXˆβX[IIZZ(ZZZZ)ZZ]yy=SSEyˆβXˆβX[IIPP]yy

where SSEy is the error sum of squares for the ANOVA model without covariates.

Setting eyy=SSEy=y(IP)yy(IP)y we can also write

SSEy.x=eyyexyEE1xxeexy

It can be shown that EExx is positive definite, so that exyEE1xxeexy>0, and so the use of covariates reduces the sum of squares, thereby leading to smaller variances.

Example (8.1.5)

Continuing the analysis of the shoesales data:

Exx=t(X)%*%(diag(n)-P)%*%X
exy=t(X)%*%(diag(n)-P)%*%y
eyy=t(y)%*%(diag(n)-P)%*%y
sse=eyy-t(exy)%*%solve(Exx)%*%exy
c(sse, eyy)
## [1] 2629827256 9938524862

Testing Hypotheses

As always we will now assume that ϵϵ=Nn(00,σ2II). The hypothesis H0:α1=..=αk=0 can be written as

H0:(CC100)(ααββ)=00

and we can use a general linear hypothesis test.

A test that is often of interest is H0:ββ=0, that is a test whether any of the covariates is useful. For a general linear hypothesis test of H0:ββ=0 we need cov(ˆβˆβ), where ˆβˆβ=[XX(IIPP)XX]1X(IP)yX(IP)y. By (4.3.19) IPIP is idempotent we have

cov(ˆβˆβ)=[X(IP)X[X(IP)X]1X(IP)X(IP)σ2I(IP)XI(IP)X[X(IP)X[X(IP)X]1=σ2[X(IP)X[X(IP)X]1X(IP)X(IP)(IP)X(IP)X[X(IP)X[X(IP)X]1=σ2[X(IP)X[X(IP)X]1[X(IP)X][X(IP)X][X(IP)X[X(IP)X]1=σ2[X(IP)X[X(IP)X]1

So now

SSH=ˆβˆβX(IP)XX(IP)X=exyE1xxeexyexyE1xxeexy

ANOVA table for test of covariates in ANCOVA design

SourcedfSSFCovariateskSSH=exyE1xxeexyexyE1xxeexySSH/kSSE/(nk1)Errornk1SSE=eyyexyE1xxexyeyyexyE1xxexyTotaln1SST=eyyeyy

Example (8.1.6)

Let’s test for Year in shoesales:

k=1
ssh=t(exy)%*%solve(Exx)%*%exy
FTS = (ssh/1)/(sse/(n-k-1))
round(c(ssh, sse/c(1, n-k-1), FTS, 1-pf(FTS, 1, n-k-1)))
## [1] 7308697607 2629827256   39251153        186          0

or with R:

summary(aov(Sales~., data=shoesales))
##             Df    Sum Sq   Mean Sq F value Pr(>F)
## Year         1 7.309e+09 7.309e+09   180.6 <2e-16
## Type         2 8.383e+09 4.192e+09   103.6 <2e-16
## Residuals   65 2.630e+09 4.046e+07

In the above we have assumed that there was no interaction between the predictors. If we suspect that there is we should run

summary(aov(Sales~(.)^2, data=shoesales))
##             Df    Sum Sq   Mean Sq F value   Pr(>F)
## Year         1 7.309e+09 7.309e+09  227.39  < 2e-16
## Type         2 8.383e+09 4.192e+09  130.41  < 2e-16
## Year:Type    2 6.049e+08 3.024e+08    9.41 0.000266
## Residuals   63 2.025e+09 3.214e+07

and indeed there such an interaction.

Balanced One-Way Model with One Covariate

In the case where we have two predictors, one categorical and one continuous, and a balanced design the formulas simplify:

ˆαˆα=ˆαˆα0(ZZZZ)ZXˆβZXˆβ=(0ˉy1.ˆβˉx1.ˉy2.ˆβˉx2.ˉyk.ˆβˉxk.)

EExx=exx=ij(xijˉxi.)2eexy=exy=ij(xijˉxi.)(yijˉyi.)eeyy=exy=ij(yijˉyi.)2ˆβ=exyeyySSEy.x=eyye2xy/exx and SSEy.x has k(n-1)-1 degrees of freedom, where k is the number of groups and n is the number of observations per group.

ANOVA table for test of one covariate in One-Way ANOVA design

SourcedfSSFCovariate1SSH=e2xy/exxSSH/kSSE/(nk1)Errork(n1)1SSE=eyye2xy/exxTotalk(n1)2SST=eyy