In this section we will use a purely geometric argument to derive the least squares estimators.
To start we will view the parameters \(\pmb{\beta}=\begin{pmatrix} \beta_0&\beta_1&...&\beta_k \end{pmatrix}'\) as point in (k+1)-dimensional space, called the parameter space. Also the vector \(\pmb{y}\) can be viewed as a point in n-dimensional space, called the data space. The matrix \(\pmb{X}\) can be written as a partitioned matrix in k+1 columns as
\[\pmb{X} = \begin{pmatrix} \pmb{j} & \pmb{x}_1 & ... & \pmb{x}_k \end{pmatrix}\]
all the columns of this matrix are n-dimensional vectors and so are again points in n-dimensional space. Because \(\pmb{X}\) is assumed to be of rank k+1, the vectors are linearly independent. The set of linear combinations of of these vectors form a subspace of data space. The elements of this space are of the form
\[\pmb{Xb} = b_0\pmb{j}+b_1 \pmb{x_1}+..+b_k\pmb{x_k}\]
where \(\pmb{b}\) is a k+1 vector of scalars, that is any vector in parameter space. It can be shown that this is indeed a subspace, that is closed under addition and multiplication. It is said to be the subspace spanned by the columns of \(\pmb{X}\) and it is called the prediction space.
The columns of \(\pmb{X}\) form a basis of the prediction space.
Under the multiple regression model \(\pmb{y}=\pmb{X\beta}+\pmb{\epsilon}\) we see that \(\pmb{y}\) is a vector in prediction space, \(E[\pmb{y}]=\pmb{X\beta}\), plus a vector of random errors \(\pmb{\epsilon}\). Here neither \(\pmb{\beta}\) nor \(\pmb{\epsilon}\) is known.
In this setup multiple regression is the problem of finding a reasonable estimate of \(E[\pmb{y}]\) in prediction space and then finding the corresponding vector in parameter space.
What do we mean by “reasonable”? An obvious answer is in terms of distance, namely the point closest to \(\pmb{y}\). Again, one needs to say how a distance is defined, and in general there are many choices. Here we will use the basic Euclidean distance. Then from geometry we find that the point \(\pmb{\hat{\epsilon}}=\pmb{y}-\pmb{\hat{y}}\) has to be perpendicular to the prediction space.
Because the prediction space is spanned by the columns of \(\pmb{X}\), the point \(\pmb{\hat{y}}\) must be such that \(\pmb{\hat{\epsilon}}\) is orthogonal to the columns of \(\pmb{X}\). From (4.1.12) we know that this means
\[\pmb{X'\hat{\epsilon}}=\pmb{0}\]
so
\[ \begin{aligned} &\pmb{0}=\pmb{X'\hat{\epsilon}} = \\ &\pmb{X}'(\pmb{y}-\pmb{\hat{y}} ) = \\ &\pmb{X}'(\pmb{y}-\pmb{X\hat{\beta}} ) = \\ &\pmb{X}'\pmb{y}-\pmb{X}'\pmb{X\hat{\beta}} \\ \end{aligned} \]
and so again we arrive at the normal equations \(\pmb{X}'\pmb{X\hat{\beta}}=\pmb{X}'\pmb{y}\)!
We can rewrite the regression model as follows:
\[ \begin{aligned} &y_i = \beta_0+\sum_{j=1}^k \beta_j x_{ij} +\epsilon_{i}= \\ &\alpha+\sum_{j=1}^n \beta_j (x_{ij}-\bar{x}_{j})+\epsilon_{i} \end{aligned} \]
where
\[\alpha=\beta_0+\sum_{j=1}^k \beta_ j\bar{x}_{j}\]
and \(\bar{x}_j=\frac1n \sum_{j=1}^k x_{ij}\) are the sample means of the variables. This is called the model in centered form.
In matrix notation we have
\[\pmb{y}= \begin{pmatrix} \pmb{j} & \pmb{X}_c \end{pmatrix} \begin{pmatrix} \pmb{\alpha} \\ \pmb{\beta}_1 \end{pmatrix} +\pmb{\epsilon}\]
where \(\pmb{\beta} = \begin{pmatrix} \beta_1 & ... & \beta_k \end{pmatrix}'\), \[ \pmb{X}_c = \left(\pmb{I}-\frac1n \pmb{J} \right)\pmb{X}_1= \begin{pmatrix} x_{11} - \bar{x}_1 & x_{12} - \bar{x}_2 & ... &x_{1k} - \bar{x}_k \\ x_{21} - \bar{x}_1 & x_{22} - \bar{x}_2 & ... &x_{2k} - \bar{x}_k \\ \vdots & \vdots & & \vdots \\ x_{n1} - \bar{x}_1 & x_{n2} - \bar{x}_2 & ... &x_{nk} - \bar{x}_k \\ \end{pmatrix} \]
and \(\pmb{X}_1\) is \(\pmb{X}\) without the column of 1’s.
In this form the normal equations become
\[\begin{pmatrix} \pmb{j} & \pmb{X}_c \end{pmatrix}'\begin{pmatrix} \pmb{j} & \pmb{X}_c \end{pmatrix} \begin{pmatrix} \pmb{\alpha} \\ \pmb{\beta}_1 \end{pmatrix}= \begin{pmatrix} \pmb{j} & \pmb{X}_c \end{pmatrix}\pmb{y}\]
but
\[\begin{pmatrix} \pmb{j} & \pmb{X}_c \end{pmatrix}'\begin{pmatrix} \pmb{j} & \pmb{X}_c \end{pmatrix}=\begin{pmatrix} n & \pmb{0}' \\ \pmb{0} & \pmb{X}'_c\pmb{X}_c \end{pmatrix}\]
and
\[\begin{pmatrix} \pmb{j} & \pmb{X}_c \end{pmatrix}\pmb{y} = \begin{pmatrix} n\bar{y} \\ \pmb{X}_c'\pmb{y} \end{pmatrix}\]
and so the least squares estimates are given by
\[ \begin{aligned} &\hat{\alpha} =\bar{y} \\ &\pmb{\hat{\beta}}_1= (\pmb{X}_c'\pmb{X}_c)^{-1}\pmb{X_c}'\pmb{y} \end{aligned} \]
Notice that these are the same estimators as in the original model, and that
\[\hat{\beta}_0 = \hat{\alpha}- \sum \hat{\beta}_j \bar{x}_j\]
For the houseprice data we find
Standard form:
A=as.matrix(houseprice)
n=nrow(A)
y=A[, 1, drop=FALSE]
X=cbind(1, A[, -1])
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
round(c(betahat), 4)
## [1] -67.6198 0.0857 -26.4931 -9.2862 37.3807
centered form:
A=as.matrix(houseprice)
n=nrow(A)
y=A[, 1, drop=FALSE]
Xc=A[, -1]
xbar=apply(Xc, 2, mean)
for(j in 1:4) Xc[ ,j]=Xc[ ,j]-xbar[j]
beta1hat= (solve(t(Xc)%*%Xc)%*%t(Xc))%*%y
round(c(beta1hat), 4)
## [1] 0.0857 -26.4931 -9.2862 37.3807
round(mean(y)-sum(beta1hat*xbar), 4)
## [1] -67.6198
Let \(s_{xx}\) be the matrix of sums of squares for the centered form, that is
\[(s_{xx})_{ij}= \frac1{n-1}\sum_{k=1}^n (x_{ki}-\bar{x}_i)(x_{kj}-\bar{x}_j)\]
and \(s_{yx}\) be the vector with
\[(s_{yx})_{i}= \frac1{n-1}\sum_{k=1}^n (x_{ki}-\bar{x}_i)(y_{k}-\bar{y})\]
then we find \(s_{xx}=\pmb{X}_c'\pmb{X}_c/(n-1)\) and \(s_{yx}=\pmb{X}_c'\pmb{y}/(n-1)\), and using this we we can write
\[\pmb{\hat{\beta}}_1=s_{xx}^{-1}s_{xy}\]
\[\hat{\beta}_0=\bar{y} - s_{yx}'s_{xx}^{-1}\pmb{\bar{x}}\]
For the houseprice data we find
A=as.matrix(houseprice)
n=nrow(A)
y=A[, 1, drop=FALSE]
Xc=A[, -1]
xbar=apply(Xc, 2, mean)
for(j in 1:4) Xc[ ,j]=Xc[ ,j]-xbar[j]
Sxx=t(Xc)%*%Xc/(n-1)
syx=t(Xc)%*%y/(n-1)
beta1hat=solve(Sxx)%*%syx
beta0hat=mean(y)-t(syx)%*%solve(Sxx)%*%cbind(xbar)
round(c(beta0hat, beta1hat), 4)
## [1] -67.6198 0.0857 -26.4931 -9.2862 37.3807
Say we have a model of the form \(\pmb{y=X_1\beta_1+X_2\beta_2+\epsilon}\) and a reduced model \(\pmb{y=X_1\beta_1^*+\epsilon^*}\). Let \(\pmb{\hat{\beta}_1}\) be the estimator of \(\pmb{\beta_1}\) and \(\pmb{\hat{\beta}^*_1}\) the estimator of \(\pmb{\beta^*_1}\). Then it is generally not true that \(\pmb{\hat{\beta}_1}\)=\(\pmb{\hat{\beta}^*_1}\).
Let’s use Sqfeet and Floors for \(\pmb{X}_1\) and bathrooms and baths for \(\pmb{X}_2\):
colnames(houseprice)
## [1] "Price" "Sqfeet" "Floors" "Bedrooms" "Baths"
fit=lm(Price~., data=houseprice)
fit1=lm(Price~Sqfeet+Floors, data=houseprice)
coef(fit)
## (Intercept) Sqfeet Floors Bedrooms Baths
## -67.61983705 0.08570823 -26.49305703 -9.28622097 37.38067201
coef(fit1)
## (Intercept) Sqfeet Floors
## -60.876065 0.091688 -4.149169
and we see that the estimators for Sqfeet and Floors differ.
However:
If \(\pmb{X_1'X_2=O}\) we have \(\pmb{\hat{\beta}_1}=\pmb{\hat{\beta}^*_1}\)
proof
We have \(\pmb{\hat{\beta}^*_1=(X_1'X_1)^{-1}X_1'y}\). For \(\pmb{\hat{\beta}_1}\) we partition \(\pmb{\hat{\beta}^*_1=(X_1'X_1)^{-1}X_1'y}\):
\[ \begin{pmatrix} \pmb{\hat{\beta}_1} \\ \pmb{\hat{\beta}_2} \end{pmatrix}= \begin{pmatrix} \pmb{X_1'X_1} & \pmb{X_1'X_2} \\ \pmb{X_2'X_1} & \pmb{X_2'X_2} \end{pmatrix}^{-1} \begin{pmatrix} \pmb{X_1'} & \pmb{X_2'} \end{pmatrix} \begin{pmatrix} \pmb{y} \\ \pmb{y} \end{pmatrix}=\\ \begin{pmatrix} \pmb{X_1'X_1} & \pmb{O} \\ \pmb{O} & \pmb{X_2'X_2} \end{pmatrix}^{-1} \begin{pmatrix} \pmb{X_1'y} \\ \pmb{X_2'y} \end{pmatrix}=\\ \begin{pmatrix} (\pmb{X_1'X_1})^{-1} & \pmb{O} \\ \pmb{O} & (\pmb{X_2'X_2})^{-1} \end{pmatrix} \begin{pmatrix} \pmb{X_1'y} \\ \pmb{X_2'y} \end{pmatrix}=\\ \begin{pmatrix} (\pmb{X_1'X_1})^{-1}\pmb{X_1'y} \\ (\pmb{X_2'X_2})^{-1}\pmb{X_2'y} \end{pmatrix} \]
We can use this theorem to obtain estimators of \(\pmb{\beta_2}\) as follows:
Regress \(\pmb{y}\) on \(\pmb{X_1}\) and calculate residuals \(\pmb{y-\hat{y}(X_1)}\), where \(\pmb{\hat{y}(X_1)}=\pmb{X_1\hat{\beta}_1=\pmb{X_1(X_1'X_1)^{-1}X_1'y}}\).
Regress the columns of \(\pmb{X_2}\) on \(\pmb{X_1}\) and obtain residuals \(\pmb{X_{2.1}=X_2-\hat{X}_2(X_1)}\). If \(\pmb{X}_2\) is written in terms of its columns as \(\pmb{X}_2 =\begin{pmatrix} \pmb{x}_{21} & .. & \pmb{x}_{2j} & ..& \pmb{x}_{2p}\end{pmatrix}\), then the regression coefficient vector for \(\pmb{x}_{2j}\) is \(\pmb{b}_{j}=\pmb{(X_1'X_1)^{-1}X_1'x_{2j}}\), and so
\[\pmb{\hat{x}}_{2j}=\pmb{X_1(X_1'X_1)^{-1}X_1'x_{2j}}\] Taking all columns of \(\pmb{X}_2\) together we get
\[\pmb{\hat{X}_2(X_1)}=\pmb{X_1(X_1'X_1)^{-1}X_1'X_2}=\pmb{X_1A}\] where \(\pmb{A}=\pmb{(X_1'X_1)^{-1}X_1'X_2}\) is called the alias matrix. Note that \(\pmb{X_{2.1}=X_2-\hat{X}_2(X_1)}\) is orthogonal to \(\pmb{X}_1\):
\[\pmb{X_1'X_{2.1}=X_1'X_2}-\pmb{X_1'X_1(X_1'X_1)^{-1}X_1'X_2}=\pmb{O}\] 3. Regress \(\pmb{y-\hat{y}(X_1)}\) on \(\pmb{X_{2.1}}\). Since \(\pmb{X_{2.1}}\) is orthogonal to \(\pmb{X}_1\), we obtain the same \(\pmb{\hat{\beta}}_2\) as in the full model.
A=as.matrix(houseprice)
y=A[, 1, drop=FALSE]
X1=cbind(1, A[, 2:3])
X2=A[, 4:5, drop=FALSE]
yhatX1 = X1%*%solve(t(X1)%*%X1)%*%t(X1)%*%y
y1=y-yhatX1
X2hat.X1=X1%*%solve(t(X1)%*%X1)%*%t(X1)%*%X2
X2.1 = X2-X2hat.X1
round(t(X1)%*%X2.1, 5)
## Bedrooms Baths
## 0 0
## Sqfeet 0 0
## Floors 0 0
round(c(solve(t(X2.1)%*%X2.1)%*%t(X2.1)%*%y1), 2)
## [1] -9.29 37.38
round(coef(lm(Price~., data=houseprice)), 2)
## (Intercept) Sqfeet Floors Bedrooms Baths
## -67.62 0.09 -26.49 -9.29 37.38