We start with a test to see whether any of the predictor variables is useful. So let \(\pmb{\beta}_1 = (\beta_1\text{ }...\text{ }\beta_k)'\), then we wish to test
\[H_0: \pmb{\beta}_1 = \pmb{0}\]
To find a test we will use the centered model
\[\pmb{y} = \begin{pmatrix} \pmb{j} & \pmb{X}_c \end{pmatrix} \begin{pmatrix} \alpha \\ \pmb{\beta}_1\end{pmatrix}+\pmb{\epsilon}\]
where \(\pmb{X}_c = [\pmb{I}-(1/n)\pmb{J}]\pmb{X}_1\) is the centered matrix and \(\pmb{X}_1\) is \(\pmb{X}\) without the column of 1’s. The corrected total sum of squares is given by
\[ \begin{aligned} &\sum (y_i-\bar{y})^2 = \\ &\pmb{\hat{\beta}}_1'\pmb{X}_c'\pmb{y} +\left[\sum (y_i-\bar{y})^2 - \pmb{\hat{\beta}}_1'\pmb{X}_c'\pmb{y}\right]= \\ &\pmb{\hat{\beta}}_1'\pmb{X}_c'\pmb{X}_c\pmb{\hat{\beta}}_1 + \text{SSE} = \\ &\text{SSR}+\text{SSE} \end{aligned} \]
using (6.4.5).
Recall the following formulas:
\[ \begin{aligned} &\sum (y_i-\bar{y})^2 = \pmb{y}'[\pmb{I}-(1/n)\pmb{J}]\pmb{y}\\ &\pmb{\hat{\beta}}_1 = (\pmb{X}_c'\pmb{X}_c)^{-1}\pmb{X}_c'\pmb{y}\\ &\text{SSE} =\sum (y_i-\bar{y})^2- \pmb{\hat{\beta}}_1\pmb{X}_c'\pmb{y}\\ \end{aligned} \]
so we have
\[ \begin{aligned} &\text{SSR}+\text{SSE} =\sum (y_i-\bar{y})^2 = \pmb{y}'[\pmb{I}-(1/n)\pmb{J}]\pmb{y} =\\ &\pmb{y'X}_c(\pmb{X}_c'\pmb{X}_c)^{-1}\pmb{X}_c'\pmb{y} + \pmb{y}'[\pmb{I}-(1/n)\pmb{J}]\pmb{y} -\pmb{y'X}_c(\pmb{X}_c'\pmb{X}_c)^{-1}\pmb{X}_c'\pmb{y}= \\ &\pmb{y}'\pmb{H}_c\pmb{y} +\pmb{y}'[\pmb{I}-(1/n)\pmb{J}-\pmb{H}_c]\pmb{y} = \\ \end{aligned} \]
where \(\pmb{H}_c=\pmb{X}_c(\pmb{X}_c'\pmb{X}_c)^{-1}\pmb{X}_c'\).
The matrices above have the following properties:
proof follow from direct calculation
If \(\pmb{y}\sim N_n(\pmb{X\beta}, \sigma^2\pmb{I})\) then
\(\text{SSR}/\sigma^2=\pmb{\hat{\beta}}_1'\pmb{X}_c'\pmb{X}_c\pmb{\hat{\beta}}_1/\sigma^2 \sim \chi^2(k, \lambda_1)\) where \(\lambda_1=\pmb{\beta}_1'\pmb{X}_c'\pmb{X}_c\pmb{\beta}_1/(2\sigma^2)\)
\(\text{SSE}/\sigma^2=[\sum (y_i-\bar{y})^2-\pmb{\hat{\beta}}_1'\pmb{X}_c'\pmb{X}_c\pmb{\hat{\beta}}_1]/\sigma^2 \sim \chi^2(n-k-1)\)
SSR and SSE are independent
proof i and ii follow from the calculation above and (5.2.2). The proof of iii is omitted.
Under the conditions of theorem (6.6.2) let
\[F=\frac{\text{SSR}/k}{\text{SSE}/(n-k-1)} \]
then
if \(H_0:\pmb{\beta}_1=\pmb{0}\) is false \(F\sim F(k, n-k-1, \lambda_1)\), where \(\lambda_1=\pmb{\beta}_1'\pmb{X}_c'\pmb{X}_c\pmb{\beta}_1/(2\sigma^2)\)
if \(H_0:\pmb{\beta}_1=\pmb{0}\) is true \(F\sim F(k, n-k-1)\)
proof
The test \(H_0:\pmb{\beta}_1=\pmb{0}\) is done as follows: Reject H0 if \(F\ge F_{\alpha, k, n-k-1}\), where \(F_{\alpha, k, n-k-1}\) is the upper \(\alpha\) percentile of a (central) F distribution with k and n-k-1 degrees of freedom.
The p value of the test is given by \(P(X>F)\), where \(X\sim F(k, n-k-1)\).
The result of such a test is usually presented in the form of an ANOVA table, which looks as follows:
\[ \begin{array}{ccccc} \text{Source} & \text{df} & \text{SS} & \text{F} & \text{TS}\\ \hline\\ \text{Due to }\beta_1 & k & \text{SSR} & \text{SSR}/k & F\\ \text{Error} & n-k-1 & \text{SSE} & \text{SSE}/(n-k-1) & \\ \text{Total} & n-1 & \text{SST}\\ \hline \\ \end{array} \]
We run this the test for the houseprice data
A=as.matrix(houseprice)
n=nrow(A);k=ncol(A)-1
c(n, k, n-k-1)
## [1] 28 4 23
y=A[, 1, drop=FALSE]
Xc=A[, -1]
xbar=apply(Xc, 2, mean)
for(j in 1:4) Xc[ ,j]=Xc[ ,j]-xbar[j]
beta1hat= (solve(t(Xc)%*%Xc)%*%t(Xc))%*%y
round(c(beta1hat), 4)
## [1] 0.0857 -26.4931 -9.2862 37.3807
\[SSR=\pmb{\hat{\beta}}_1'\pmb{X}_c'\pmb{y}\]
ssr=t(beta1hat)%*%t(Xc)%*%y
ssr/c(1, k)
## [1] 33670.654 8417.663
\[SSE=\sum(y_i-\bar{y})^2-SSR\]
sst=sum((y-mean(y))^2)
sst
## [1] 37992.52
sse=sst-ssr
sse/c(1, n-k-1)
## [1] 4321.8636 187.9071
FTS = (ssr/k)/(sse/(n-k-1))
FTS
## Price
## Price 44.79694
\[ \begin{array}{ccccc} \text{Source} & \text{df} & \text{SS} & \text{F} & \text{TS}\\ \hline \\ \text{Due to }\beta_1 & 4 & 33670 & 8418 & 44.8\\ \text{Error} & 23 & 4322 & 187.9 & \\ \text{Total} & 27 & 37992\\ \hline &\\ \end{array} \]
If we test at the 5% level the critical value is
qf(0.95, k, n-k-1)
## [1] 2.795539
\(F=44.8>2.8\), and so we reject the null hypothesis, at least some of the variables are useful for predicting the house prices.
The p value of the test is
1-pf(FTS, k, n-k-1)
## Price
## Price 1.55808e-10
Here is the printout of the lm command:
summary(lm(Price~., data=houseprice))
##
## Call:
## lm(formula = Price ~ ., data = houseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.018 -5.943 1.860 5.947 30.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.61984 17.70818 -3.819 0.000882
## Sqfeet 0.08571 0.01076 7.966 4.62e-08
## Floors -26.49306 9.48952 -2.792 0.010363
## Bedrooms -9.28622 6.82985 -1.360 0.187121
## Baths 37.38067 12.26436 3.048 0.005709
##
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared: 0.8862, Adjusted R-squared: 0.8665
## F-statistic: 44.8 on 4 and 23 DF, p-value: 1.558e-10
and we find the F statistic, degrees of freedom and p value in the last line.
Say we have split \(\pmb{\beta} = (\pmb{\beta}_1\text{ }\pmb{\beta}_2)'\) and we wish to test \(H_0:\pmb{\beta}_2=0\). Then we can partition \(\pmb{X}\) accordingly, so the model becomes
\[\pmb{y}=\pmb{X_1\beta_1}+\pmb{X_2\beta_2}+\pmb{\epsilon}\]
where the intercept \(\beta_0\) is included in \(\pmb{\beta}_1\).
We define matrices
\[\pmb{H}=\pmb{X}(\pmb{X}'\pmb{X})^{-1}\pmb{X}'\] \[\pmb{H}_1=\pmb{X}_1(\pmb{X}_1'\pmb{X}_1)^{-1}\pmb{X}_1'\]
so that \(\pmb{Hy}\) is the least squares estimator of \(\pmb{\beta}\) under the full model and \(\pmb{H_1y}\) is the least squares estimator of \(\pmb{\beta}_1\) under reduced model assuming the null hypothesis is true.
\(\pmb{H}-\pmb{H}_1\) is idempotent with rank h, where h is the number of elements in \(\pmb{\beta}_2\)
proof omitted
If \(\pmb{y}\sim N_n(\pmb{X\beta}, \sigma^2\pmb{I})\) then
\(\pmb{y'(I-H)y}/\sigma^2\sim \chi^2(n-k-1)\)
\(\pmb{y'(H-H_1)y}/\sigma^2\sim \chi^2(h, \lambda_1)\), where
\[\lambda_1=\pmb{\beta}_2'[\pmb{X}_2'\pmb{X}_2-\pmb{X}_2'\pmb{X}_1(\pmb{X}_1'\pmb{X}_1)^{-1}\pmb{X}_1'\pmb{X}_2]\pmb{\beta}_2/(2\sigma^2)\]
proof omitted
If \(\pmb{y}\sim N_n(\pmb{X\beta}, \sigma^2\pmb{I})\) and define
\[F=\frac{\pmb{y'(H-H_1)y}/h}{\pmb{y'(I-H)y}/(n-k-1)}\]
then
if \(H_0:\pmb{\beta_2}=0\) is false \(F\sim F(h, n-k-1,\lambda_1)\)
if \(H_0:\pmb{\beta_2}=0\) is true \(F\sim F(h, n-k-1)\)
proof same as above
so the test rejects the null hypothesis if \(F\ge F_{\alpha, h, n-k-1}\)
Let’s check this on some simulated data:
n=100
x1=runif(n)
x2=runif(n)
x3=runif(n)
x4=runif(n)
y=x1+x2+rnorm(n,0,0.1)
round(c(cor(y, x1), cor(y, x2), cor(y, x3), cor(y, x4)), 3)
## [1] 0.634 0.612 -0.093 0.094
X=cbind(1,x1,x2,x3,x4)
X1=cbind(1,x1,x2)
k=ncol(X)-1
h=ncol(X1)-1
H=X%*%solve(t(X)%*%X)%*%t(X)
H1=X1%*%solve(t(X1)%*%X1)%*%t(X1)
num=t(y)%*%(H-H1)%*%y/h
denom=t(y)%*%(diag(nrow(H))-H)%*%y/(n-k-1)
FTS=num/denom
c(FTS, qf(0.95, h, n-k-1), 1-pf(FTS, h, n-k-1))
## [1] 0.8161282 3.0922174 0.4452157
X=cbind(1,x1,x3,x2,x4)
X1=cbind(1,x1,x3)
k=ncol(X)-1
h=ncol(X1)-1
H=X%*%solve(t(X)%*%X)%*%t(X)
H1=X1%*%solve(t(X1)%*%X1)%*%t(X1)
num=t(y)%*%(H-H1)%*%y/h
denom=t(y)%*%(diag(nrow(H))-H)%*%y/(n-k-1)
FTS=num/denom
c(FTS, qf(0.95, h, n-k-1), 1-pf(FTS, h, n-k-1))
## [1] 248.023486 3.092217 0.000000
Let’s consider the houseprice data. First note
round(cor(houseprice)[1, -1], 3)
## Sqfeet Floors Bedrooms Baths
## 0.915 0.291 0.605 0.653
that the variable with the smallest correlation with Price is Floors, so one might want to test whether Floors is a useful predictor variable. Let’s see:
A=as.matrix(houseprice)
n=nrow(A)
k=ncol(A)-1
X=cbind(1, A[, -1])
y=A[, 1, drop=FALSE]
X1=X[, -3]
H=X%*%solve(t(X)%*%X)%*%t(X)
H1=X1%*%solve(t(X1)%*%X1)%*%t(X1)
h=1
num=t(y)%*%(H-H1)%*%y/h
denom=t(y)%*%(diag(nrow(H))-H)%*%y/(n-k-1)
FTS=num/denom
c(FTS, qf(0.95, h, n-k-1))
## [1] 7.794275 4.279344
\(7.79>4.28\), so we reject the null hypothesis, Floors is (borderline) useful.
The p-value is
1-pf(FTS, 1, n-k-1)
## Price
## Price 0.01036282
Comment using correlations with the response to see whether a variable is useful in a multiple regression problem turns out to be a bad idea. We will return to this important topic soon.
We have the results from an experiment designed to determine how much the speed of a washing machine effects the wear on a new fabric. The machine was run at 5 different speeds (measured in rpm) and with six pieces of fabric each.
ggplot(data=fabricwear, aes(Speed, Wear)) +
geom_point()
The scatterplot makes it clear that a linear model is not going to work, so we will try a polynomial model. Because the predictor variable has large values we first standardize it:
n=length(fabricwear$Speed)
x=(fabricwear$Speed-110)/(190-110)
X=cbind(1, x, x^2)
y=cbind(fabricwear$Wear)
k=ncol(X)
X1=X[, -k]
H=X%*%solve(t(X)%*%X)%*%t(X)
H1=X1%*%solve(t(X1)%*%X1)%*%t(X1)
num=t(y)%*%(H-H1)%*%y
denom=t(y)%*%(diag(nrow(H))-H)%*%y/(n-k-1)
round(c(num/denom, qf(0.95, 1, n-k-1)), 2)
## [1] 78.91 4.23
\(78.91\ge 4.23\), and so we reject the null hypothesis, the quadratic term improves the fit significantly.
How about a cubic term?
X=cbind(1, x, x^2,x^3)
k=ncol(X)
X1=X[, -k]
H=X%*%solve(t(X)%*%X)%*%t(X)
H1=X1%*%solve(t(X1)%*%X1)%*%t(X1)
num=t(y)%*%(H-H1)%*%y
denom=t(y)%*%(diag(nrow(H))-H)%*%y/(n-k-1)
round(c(num/denom, qf(0.95, 1, n-k-1)), 2)
## [1] 0.78 4.24
\(0.78<4.24\), and so a cubic term is not needed.
Recall the definition of R2:
\[R^2=\frac{\pmb{\hat{\beta}'X'y}-n\bar{y}^2}{\pmb{y'y}-n\bar{y}^2}\]
Now
\[F=\frac{R^2/k}{(1-R^2)/(n-k-1)}\]
\[F=\frac{(R^2-R_r^2)/h}{(1-R^2)/(n-k-1)}\]
where \(R_r^2\) is is \(R^2\) for the reduced model
proof direct calculation
Let \(\pmb{C}\) be a \(q\times (k+1)\) matrix of constants, then a test of a hypothesis of the form
\[H_0: \pmb{C\beta}=0\]
is called the general linear hypothesis
The test for the full model corresponds to \(\pmb{C}= \begin{pmatrix} \pmb{0} & \pmb{I}_k \end{pmatrix}\).
The test for the a set of predictors corresponds to \(\pmb{C}= \begin{pmatrix} \pmb{0} & \pmb{I}_h \end{pmatrix}\).
We can also use this to test many other hypothesis:
Say have \(\pmb{\beta}=(\beta_0, \beta_1, .., \beta_4)'\) and we want to test
\[H_0: \beta_2-2\beta_3=\beta_2+\beta_4=0\]
then we can write this as follows
\[ H_0: \begin{pmatrix} 0 & 0 & 1 & -2 & 0 \\ 0 & 0 & 1 & 0 & 1\\ \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \\ \end{pmatrix}=\pmb{0} \]
If \(\pmb{y}\sim N_n(\pmb{X\beta}, \sigma^2\pmb{I})\) and \(\pmb{C}\) be a \(q\times (k+1)\) matrix of constants, then
\(\pmb{C\hat{\beta}}\sim N_q\left( \pmb{C\beta}, \sigma^2\pmb{C(X'X)^{-1}C'} \right)\)
SSH/\(\sigma^2\) = \((\pmb{C\hat{\beta}})'[\pmb{C(X'X)^{-1}C'}]^{-1} \pmb{C\hat{\beta}}\sim \chi^2(q,\lambda)\)
where \(\lambda=(\pmb{C\beta)})'[\pmb{C(X'X)^{-1}C'}]^{-1}\pmb{C\beta}/(2\sigma^2)\)
SSE/\(\sigma^2\) = \(\pmb{y'[I-X(X'X)^{-1}X']y}/\sigma^2\sim \chi^2(n-k-1\)
SSH and SSE are independent
proof omitted
Say \(\pmb{y}\sim N_n(\pmb{X\beta}, \sigma^2\pmb{I})\) and \(\pmb{C}\) be a \(q\times (k+1)\) matrix of constants.
Let
\[F=\frac{\text{SSH}/q}{\text{SSE}/(n-k-1)}\]
If \(H_0: \pmb{C\beta}=0\) is false, then \(F\sim F(q, n-k-1, \lambda )\), where \(\lambda=(\pmb{C\beta)})'[\pmb{C(X'X)^{-1}C'}]^{-1}\pmb{C\beta}/(2\sigma^2)\)
If \(H_0: \pmb{C\beta}=0\) is true, then \(F\sim F(q, n-k-1)\),
proof same as theorems above
If we want to test a single condition, for example \(H_0: \beta_1=\beta_2\), we can use the test above but the calculations simplify. In this case \(\pmb{C}\) is a matrix of one row, so we can write the null hypothesis as \(H_0: \pmb{a'\beta}=0\), and the test statistic becomes
\[F=\frac{(\pmb{a'\hat{\beta}})^2}{s^2\pmb{a'(X'X)^{-1}a}}\]
where s2=SSE/(n-k-1). Under the null hypothesis F now has an F distribution with 1 and n-k-1 degrees of freedom.
Also consider a test for a single variable, that is the test \(H_0:\beta_j=0\). This can be done with the above and \(\pmb{a}=(0,..,1,..,0)'\), which leads to
\[F=\frac{\hat{\beta}_j^2}{s^2g_{jj}}\]
where \(g_{jj}\) is the jth diagonal element of \(\pmb{(X'X)}^{-1}\).
Since an F distribution with 1 and n degrees of freedom is equal to a t distribution with n degrees of freedom we also have the following test: Let
\[T=\frac{\hat{\beta}_j}{s\sqrt{g_{jj}}}\]
and reject the null hypothesis if \(|T|\ge t(\alpha/2,n-k-1)\)
We can now make our own summary(lm(..)) routine:
mysummarylm=function(formula, data) {
cl <- match.call()
cat("Call:\n")
print(cl)
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset",
"weights", "na.action", "offset"),
names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- quote(stats::model.frame)
mf <- eval(mf, parent.frame())
y=mf[, 1]
X=mf[, -1]
if(is.data.frame(X)) X=as.matrix(X)
lbls=c("(Intercept)", colnames(X))
n=length(y)
k=ncol(X)
X=cbind(1, X)
y=cbind(y)
ybar=mean(y)
G=solve(t(X)%*%X)
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
yhat=X%*%betahat
residuals=y-yhat
sse = sum((y-yhat)^2)
ssr=sum( (yhat-ybar)^2 )
qres=c(round(quantile(residuals, c(0, 0.25, 0.5, 0.75, 1)), 3))
names(qres)=c("Min", "1Q", "Median", "3Q", "Max")
cat("\nResiduals:\n")
print(qres)
out=as.data.frame(matrix(0, 5, 4))
colnames(out)=c("Estimate", " Std.Error", "t value", "p-value")
rownames(out)=lbls
for(i in 1:5) {
TS=betahat[i]/sqrt(sse/(n-k-1)*G[i, i])
out[i, ]=c(round(betahat[i], 5),
round(sqrt(sse/(n-k-1)*G[i, i]), 5),
round(TS, 3),
round(2*(1-pt(abs(TS), n-k-1)), 6)
)
}
cat("\nCoefficients:\n")
print(out)
cat("\nResidual standard error: ", round(sqrt(sse/(n-k-1)), 2), " on ", n-k-1, " degrees of freedom\n")
R2=cor(y,yhat)^2
R2adj=1-((n-1)*(1-R2)/(n-k-1))
cat("Multiple R-squared : ", round(R2, 4), ", Adjusted R-squared: ", round(R2adj, 4), "\n")
FTS= (ssr/k)/(sse/(n-k-1))
cat("F-statistic: ", round(FTS, 1), " on ", k, " and ", n-k-1, " DF, p-value: ", 1-pf(FTS, k, n-k-1),"\n")
}
summary(lm(Price~., data=houseprice))
##
## Call:
## lm(formula = Price ~ ., data = houseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.018 -5.943 1.860 5.947 30.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.61984 17.70818 -3.819 0.000882
## Sqfeet 0.08571 0.01076 7.966 4.62e-08
## Floors -26.49306 9.48952 -2.792 0.010363
## Bedrooms -9.28622 6.82985 -1.360 0.187121
## Baths 37.38067 12.26436 3.048 0.005709
##
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared: 0.8862, Adjusted R-squared: 0.8665
## F-statistic: 44.8 on 4 and 23 DF, p-value: 1.558e-10
mysummarylm(Price~., data=houseprice)
## Call:
## mysummarylm(formula = Price ~ ., data = houseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.018 -5.943 1.860 5.947 30.955
##
## Coefficients:
## Estimate Std.Error t value p-value
## (Intercept) -67.61984 17.70818 -3.819 0.000882
## Sqfeet 0.08571 0.01076 7.966 0.000000
## Floors -26.49306 9.48952 -2.792 0.010363
## Bedrooms -9.28622 6.82985 -1.360 0.187121
## Baths 37.38067 12.26436 3.048 0.005709
##
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared : 0.8862 , Adjusted R-squared: 0.8665
## F-statistic: 44.8 on 4 and 23 DF, p-value: 1.55808e-10
The one item we have not discussed, and won’t, is the adjusted \(R^2\):
\[R^2_{adj}=1-\left[\frac{(n-1)(1-R^2)}{n-k-1}\right]\]
The idea here is to penalize a model for the number of predictors. As k increases \(R^2_{adj}\) (might) go down, whereas \(R^2\) never does. In this way one could use \(R^2_{adj}\) for model selection. However, there are better ways to do this, as we will see shortly.