We have data on the number of shoes sold by year and type:
kable.nice(head(shoesales), do.row.names = FALSE)
Sales | Year | Type |
---|---|---|
1539 | 1 | Mens |
12984 | 1 | Kids |
25809 | 1 | Ladies |
5742 | 2 | Mens |
30058 | 2 | Kids |
34764 | 2 | Ladies |
There are two ways to look at this problem:
as a regression problem with response Sales and predictors Year and Type, where Type is a categorical variable, usually called a dummy variable.
as a one-way ANOVA problem with some additional information, called a covariate. This is then called Analysis of Covariance ANCOVA.
From the above it is clear that we have here a blend of regression and ANOVA.
In order to study this as a regression problem we first need to code the categorical variable. The obvious way to do this is to assign numbers, for example Mens=1, Kids=2 and Ladies=3. This however is usually a bad idea because it does two things:
The better way to do this is to introduce two dummy variables:
so if d1+d2=0 it has to be Ladies.
d1=ifelse(shoesales$Type=="Mens", 1, 0)
d2=ifelse(shoesales$Type=="Kids", 1, 0)
y=shoesales$Sales
X=data.frame(Sales=y, Year=shoesales$Year, d1=d1, d2=d2)
fit.shoe1=lm(Sales~., data=X)
summary(fit.shoe1)
##
## Call:
## lm(formula = Sales ~ ., data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12963.7 -3433.5 -469.7 3349.1 22146.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28416.0 1917.8 14.817 < 2e-16
## Year 1551.6 115.4 13.440 < 2e-16
## d1 -26986.9 1875.7 -14.388 < 2e-16
## d2 -14212.4 1875.7 -7.577 1.65e-10
##
## Residual standard error: 6361 on 65 degrees of freedom
## Multiple R-squared: 0.8565, Adjusted R-squared: 0.8498
## F-statistic: 129.3 on 3 and 65 DF, p-value: < 2.2e-16
However, notice the following
x=seq(min(shoesales$Year), max(shoesales$Year), length=100)
z1=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x+coef(fit.shoe1)[3]
z2=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x+coef(fit.shoe1)[4]
z3=coef(fit.shoe1)[1]+coef(fit.shoe1)[2]*x
df=data.frame(Year=rep(x, 3),
Sales=c(z1, z2, z3),
Type=rep(c("Mens", "Kids", "Ladies"), each=100))
ggplot(data=df, aes(Year, Sales, color=Type)) +
geom_line()
and so we see that this fits parallel lines!
Here is how to fit separate lines for each group:
fit.shoe2=lm(Sales~(.)^2, data=X)
summary(fit.shoe2)
##
## Call:
## lm(formula = Sales ~ (.)^2, data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11588.7 -3433.0 -256.7 2947.3 16121.3
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21185.6 2443.6 8.670 2.42e-12
## Year 2154.1 178.2 12.087 < 2e-16
## d1 -14185.1 3455.7 -4.105 0.000119
## d2 -5322.9 3455.7 -1.540 0.128490
## Year:d1 -1066.8 252.0 -4.233 7.64e-05
## Year:d2 -740.8 252.0 -2.939 0.004594
## d1:d2 NA NA NA NA
##
## Residual standard error: 5669 on 63 degrees of freedom
## Multiple R-squared: 0.8895, Adjusted R-squared: 0.8807
## F-statistic: 101.4 on 5 and 63 DF, p-value: < 2.2e-16
z1=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x+coef(fit.shoe2)[3]+coef(fit.shoe2)[3]+coef(fit.shoe2)[5]*x
z2=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x+coef(fit.shoe2)[4]+coef(fit.shoe2)[5]*x
z3=coef(fit.shoe2)[1]+coef(fit.shoe2)[2]*x
df=data.frame(Year=rep(x, 3),
Sales=c(z1, z2, z3),
Type=rep(c("Mens", "Kids", "Ladies"), each=100))
ggplot(data=df, aes(Year, Sales, color=Type)) +
geom_line()
The model can be written as
y=Zα+Xβ+ϵy=Zα+Xβ+ϵ
where ZZ contains 0’s and 1’s, αα contains μ and parameters such as αi, βj and γij representing factors and interactions; XX a matrix of covariate values and ββ their coefficients.
For the shoesales data we have
y=Zα+Xβ+ϵy=Zα+Xβ+ϵ=(11001100⋮⋮⋮⋮11001010⋮⋮⋮101010101001⋮⋮⋮1001)(μα1α2α3)+(x11x12⋮x1n1x21⋮x3n3)β
We will assume that ZZ is less than full-rank and that XX is full-rank. We can withe the model in the form
y=Zα+Xβ+ϵy=Zα+Xβ+ϵ=(ZZXX)(ααββ)+ϵϵ=Uθ+ϵUθ+ϵ
The normal equations are
U′Uˆθ=U′yU′Uˆθ=U′y
which is equal to
(Z′ZZ′ZZ′XZ′XX′ZX′ZX′XX′X)(ˆαˆαˆβˆβ)=(Z′yZ′yX′yX′y)
which is the two equation
Z′ZˆαZ′Zˆα+Z′XˆβZ′Xˆβ=Z′yZ′yX′ZˆαX′Zˆα+X′XˆβX′Xˆβ=X′yX′y
using a generalized inverse for Z′ZZ′Z we find
ˆαˆα=(Z′ZZ′Z)−Z′yZ′y−(Z′ZZ′Z)−Z′XˆβZ′Xˆβ=ˆαˆα0−(Z′ZZ′Z)−Z′XˆβZ′Xˆβ
where ˆαˆα0 is a solution of the normal equations without the covariates.
To find ˆβˆβ we substitute this into the second equation:
X′ZX′Z[(Z′ZZ′Z)−Z′yZ′y−(Z′ZZ′Z)−Z′XˆβZ′Xˆβ]+X′XˆβX′Xˆβ=X′yX′yX′X′[II−ZZ(Z′ZZ′Z)−Z′Z′]XXˆβˆβ=X′yX′y−X′ZX′Z(Z′ZZ′Z)−Z′yZ′yX′X′[II−PP]XXˆβˆβ=X′X′(I−PI−P)yy
where P=ZP=Z(Z′ZZ′Z)−Z′Z′. Because XX is a set of covariates it is (almost always) independent of ZZ, and so X′X′[II−PP]XX is non-singular. So we find
ˆβˆβ=[X′X′(II−PP)XX]−1X′(I−P)yX′(I−P)y=EE−1xxeexy
Let’s find the estimates for the shoesales data. For that we need to reorder the data as well
shoesales=shoesales[order(shoesales$Type), ]
X=matrix(as.numeric(shoesales[ ,"Year"]),ncol=1)
y=matrix(as.numeric(shoesales[ ,"Sales"]),ncol=1)
ni=table(shoesales$Type)
ni
##
## Kids Ladies Mens
## 23 23 23
n=sum(ni)
Z=matrix(0, n, 4)
Z[, 1]=1
Z[1:ni[1], 2]=1
Z[(1+ni[1]):(ni[1]+ni[2]), 3]=1
Z[(1+ni[1]+ni[2]):n, 4]=1
library(MASS)
gZZ=ginv(t(Z)%*%Z)
P=Z%*%gZZ%*%t(Z)
betahat=solve(t(X)%*%(diag(n)-P)%*%X)%*%t(X)%*%(diag(n)-P)%*%y
alpha0hat=gZZ%*%t(Z)%*%y
c(alpha0hat)
## [1] 24976.239 7846.109 22058.500 -4928.370
alphahat=alpha0hat-gZZ%*%t(Z)%*%X%*%betahat
c(alphahat, betahat)
## [1] 11012.182 3191.423 17403.814 -9583.055 1551.562
Now for the sum of squares we find
SSE=SSEy.x=y′y−ˆθU′yy′y−ˆθU′y=y′yy′y−(ˆα′ˆα′ˆβ′ˆβ′)(Z′yZ′yX′yX′y)=y′yy′y−ˆα′Z′yˆα′Z′y−ˆβ′X′yˆβ′X′y=y′yy′y−[ˆαˆα′0−ˆβ′X′Zˆβ′X′Z(Z′ZZ′Z)−]Z′yZ′y−ˆβ′X′yˆβ′X′y=y′yy′y−ˆαˆα′0Z′yZ′y+ˆβ′X′Zˆβ′X′Z(Z′ZZ′Z)−Z′yZ′y−ˆβ′X′yˆβ′X′y=y′yy′y−ˆαˆα′0Z′yZ′y−ˆβ′X′ˆβ′X′[II−ZZ(Z′ZZ′Z)−Z′Z′]yy=SSEy−ˆβ′X′ˆβ′X′[II−PP]yy
where SSEy is the error sum of squares for the ANOVA model without covariates.
Setting eyy=SSEy=y′(I−P)yy′(I−P)y we can also write
SSEy.x=eyy−e′xyEE−1xxeexy
It can be shown that EExx is positive definite, so that e′xyEE−1xxeexy>0, and so the use of covariates reduces the sum of squares, thereby leading to smaller variances.
Continuing the analysis of the shoesales data:
Exx=t(X)%*%(diag(n)-P)%*%X
exy=t(X)%*%(diag(n)-P)%*%y
eyy=t(y)%*%(diag(n)-P)%*%y
sse=eyy-t(exy)%*%solve(Exx)%*%exy
c(sse, eyy)
## [1] 2629827256 9938524862
As always we will now assume that ϵϵ=Nn(00,σ2II). The hypothesis H0:α1=..=αk=0 can be written as
H0:(CC100)(ααββ)=00
and we can use a general linear hypothesis test.
A test that is often of interest is H0:ββ=0, that is a test whether any of the covariates is useful. For a general linear hypothesis test of H0:ββ=0 we need cov(ˆβˆβ), where ˆβˆβ=[X′X′(II−PP)XX]−1X′(I−P)yX′(I−P)y. By (4.3.19) I−PI−P is idempotent we have
cov(ˆβˆβ)=[X′(I−P)X[X′(I−P)X]−1X′(I−P)X′(I−P)σ2I(I−P)XI(I−P)X[X′(I−P)X[X′(I−P)X]−1=σ2[X′(I−P)X[X′(I−P)X]−1X′(I−P)X′(I−P)(I−P)X(I−P)X[X′(I−P)X[X′(I−P)X]−1=σ2[X′(I−P)X[X′(I−P)X]−1[X′(I−P)X][X′(I−P)X][X′(I−P)X[X′(I−P)X]−1=σ2[X′(I−P)X[X′(I−P)X]−1
So now
SSH=ˆβˆβX′(I−P)XX′(I−P)X=e′xyE−1xxeexye′xyE−1xxeexy
ANOVA table for test of covariates in ANCOVA design
SourcedfSSFCovariateskSSH=e′xyE−1xxeexye′xyE−1xxeexySSH/kSSE/(n−k−1)Errorn−k−1SSE=eyy−e′xyE−1xxexyeyy−e′xyE−1xxexyTotaln−1SST=eyyeyy
Let’s test for Year in shoesales:
k=1
ssh=t(exy)%*%solve(Exx)%*%exy
FTS = (ssh/1)/(sse/(n-k-1))
round(c(ssh, sse/c(1, n-k-1), FTS, 1-pf(FTS, 1, n-k-1)))
## [1] 7308697607 2629827256 39251153 186 0
or with R:
summary(aov(Sales~., data=shoesales))
## Df Sum Sq Mean Sq F value Pr(>F)
## Year 1 7.309e+09 7.309e+09 180.6 <2e-16
## Type 2 8.383e+09 4.192e+09 103.6 <2e-16
## Residuals 65 2.630e+09 4.046e+07
In the above we have assumed that there was no interaction between the predictors. If we suspect that there is we should run
summary(aov(Sales~(.)^2, data=shoesales))
## Df Sum Sq Mean Sq F value Pr(>F)
## Year 1 7.309e+09 7.309e+09 227.39 < 2e-16
## Type 2 8.383e+09 4.192e+09 130.41 < 2e-16
## Year:Type 2 6.049e+08 3.024e+08 9.41 0.000266
## Residuals 63 2.025e+09 3.214e+07
and indeed there such an interaction.
In the case where we have two predictors, one categorical and one continuous, and a balanced design the formulas simplify:
ˆαˆα=ˆαˆα0−(Z′ZZ′Z)−Z′XˆβZ′Xˆβ=(0ˉy1.−ˆβˉx1.ˉy2.−ˆβˉx2.⋮ˉyk.−ˆβˉxk.)
EExx=exx=∑ij(xij−ˉxi.)2eexy=exy=∑ij(xij−ˉxi.)(yij−ˉyi.)eeyy=exy=∑ij(yij−ˉyi.)2ˆβ=exyeyySSEy.x=eyy−e2xy/exx and SSEy.x has k(n-1)-1 degrees of freedom, where k is the number of groups and n is the number of observations per group.
ANOVA table for test of one covariate in One-Way ANOVA design
SourcedfSSFCovariate1SSH=e2xy/exxSSH/kSSE/(n−k−1)Errork(n−1)−1SSE=eyy−e2xy/exxTotalk(n−1)−2SST=eyy