Multiple Regression

The Model

Definition (6.2.1)

We have a response vector \(\pmb{y}=\begin{pmatrix} y_1 & ... y_n \end{pmatrix}'\), a vector of regression coefficients \(\pmb{\beta}=\begin{pmatrix}\beta_0&\beta_1&...&\beta_k \end{pmatrix}'\) and a predictor matrix

\[ \pmb{X} = \begin{pmatrix} 1 & x_{11} & ... & x_{1k} \\ \vdots & \vdots & & \vdots\\ 1 & x_{n1} & & x_{nk}\\ \end{pmatrix} \]

We also have a vector of errors \(\pmb{\epsilon}=\begin{pmatrix} \epsilon_1 & ... &\epsilon_n \end{pmatrix}'\). Then a model of the form

\[y_i = \beta_0+\beta_1x_{i1}+..+\beta_kx_{ik}\] \(i=1,..,n\).

or

\[\pmb{y}=\pmb{X\beta}+\pmb{\epsilon}\]

is called a multiple regression model. \(\pmb{X}\) is called the design matrix. As in the simple regression case we assume for now that \(\pmb{X}\) is fixed and not random.

Note that this includes models like

\[y_i = \beta_0+\beta_1x_{i}+\beta_2x_{i}^2\]

\[y_i = \beta_0+\beta_1\log(x_{i})\]

because they are models linear in the coefficients \(\beta\).

The assumptions are the same as in the simple regression model:

  1. \(E[\epsilon_i]=0\) (model is correct)
  2. \(var(\epsilon_i)=\sigma^2\) (equal variance, homoscadasticity)
  3. \(cov(\epsilon_i, \epsilon_j)=0\) (independence)

Estimation of \(\pmb{\beta}\) and \(\sigma^2\)

Analogous to the simple regression case we can use the method of least squares and estimate the coefficients by minimizing

\[\sum_{i=1}^n (y_i - \beta_0-\beta_1x_{i1}-..-\beta_kx_{ik})^2\]

Again one can differentiate this expression and solve the resulting system of equations, however here is a better solution:

Theorem (6.2.2)

If \(\pmb{y}=\pmb{X\beta}+\pmb{\epsilon}\) where \(\pmb{X}\) is a \(n\times (k+1)\) matrix of rank k+1<n, then the vector \(\pmb{\hat{\beta}}\) that minimizes the least squares criterion is given by

\[\pmb{\hat{\beta}} = \pmb{(X'X)}^{-1}\pmb{X'y}\]

proof

we have

\[ \begin{aligned} &\pmb{\hat{\epsilon}'\hat{\epsilon}}= (\pmb{y}-\pmb{X\hat{\beta}})'(\pmb{y}-\pmb{X\hat{\beta}}) = \\ &\pmb{y}'(\pmb{y}-\pmb{X\hat{\beta}}) -(\pmb{X\hat{\beta}})'(\pmb{y}-\pmb{X\hat{\beta}}) = \\ &\pmb{y}'\pmb{y}-\pmb{y}'\pmb{X\hat{\beta}} -\pmb{y'(X\hat{\beta}})+\pmb{\hat{\beta}'X'}\pmb{X\hat{\beta}} = \\ &\pmb{y}'\pmb{y}-2\pmb{y}'\pmb{X\hat{\beta}}+\pmb{\hat{\beta}'X'}\pmb{X\hat{\beta}} \\ \end{aligned} \] because \(\pmb{y}'\pmb{X\hat{\beta}}\) is a scalar.

By (4.3.20) and (4.3.21) we have

\[ \begin{aligned} &\partial \pmb{\hat{\epsilon}'\hat{\epsilon}} /\partial \pmb{\beta} = \\ &\partial (\pmb{y}'\pmb{y}-2\pmb{y}'\pmb{X\hat{\beta}}+\pmb{\hat{\beta}'X'}\pmb{X\hat{\beta}}) /\partial \pmb{\beta} = \\ &\partial (\pmb{y}'\pmb{y})/\partial \pmb{\beta}-\partial (2\pmb{y}'\pmb{X\hat{\beta})}/\partial \pmb{\beta} +\partial (\pmb{\hat{\beta}'X'}\pmb{X\hat{\beta}}) /\partial \pmb{\beta} = \\ &\pmb{0}-2\pmb{X}'\pmb{y}+2\pmb{X'X\beta} =\pmb{0} \\ &\pmb{X'X\hat{\beta}}=\pmb{X}'\pmb{y} \end{aligned} \]

\(\pmb{X'X}\) is full-rank and therefore has an inverse, and so we have the result.

Definition

\[\pmb{X'X\hat{\beta}}=\pmb{X}'\pmb{y}\]

are called the normal equations.

Example (6.2.3)

Prices of 28 residencies located 30 miles south of a large metropolitan area.

kable.nice(houseprice, do.row.names = FALSE)
Price Sqfeet Floors Bedrooms Baths
69.0 1500.000 1 2 1.0
118.5 1880.952 1 2 2.0
104.0 1976.190 1 3 2.0
116.5 1880.952 1 3 2.0
121.5 1880.952 1 3 2.0
125.0 1976.190 1 3 2.0
128.0 2357.143 2 3 2.5
129.9 2166.667 1 3 1.7
133.0 2166.667 2 3 2.5
135.0 2166.667 2 3 2.5
137.5 2357.143 2 3 2.5
139.9 2166.667 1 3 2.0
143.9 2261.905 2 3 2.5
147.9 2547.619 2 3 2.5
154.9 2357.143 2 3 2.5
160.0 2738.095 2 3 2.0
169.0 2357.143 1 3 2.0
169.9 2642.857 1 3 2.0
125.0 2166.667 1 4 2.0
134.9 2166.667 1 4 2.0
139.9 2547.619 1 4 2.0
147.0 2642.857 1 4 2.0
159.0 2261.905 1 4 2.0
169.9 2547.619 2 4 3.0
178.9 2738.095 1 4 2.0
194.5 2833.333 2 4 3.0
219.9 2928.571 1 4 2.5
269.0 3309.524 2 4 3.0

Here we want to predict the Price from the other four variables, so

par(mfrow=c(2, 2))
plot(houseprice$Sqfeet, houseprice$Price, pch=20)
boxplot(houseprice$Floors, houseprice$Price)
boxplot(houseprice$Bedrooms, houseprice$Price)
boxplot(houseprice$Baths, houseprice$Price)

A=as.matrix(houseprice)
y=A[, 1, drop=FALSE]
X=cbind(1, A[, -1])
beta= solve(t(X)%*%X)%*%t(X)%*%y
round(c(beta), 3)
## [1] -67.620   0.086 -26.493  -9.286  37.381

We can write

\[ \pmb{X'X} = \begin{pmatrix} n & \sum_i x_{i1} & \sum_i x_{i2} & ... & \sum_i x_{ik} \\ \sum_i x_{i1} & \sum_i x_{i1}^2 & \sum_i x_{i1}x_{i2} & ... & \sum_i x_{i1}x_{ik}\\ \vdots & \vdots &\vdots & ... & \vdots & \\ \sum_i x_{ik} & \sum_i x_{i1}x_{ik} & \sum_i x_{i2}x_{i2} & ... & \sum_i x_{ik}^2 \end{pmatrix} \]

and

\[ \pmb{X'y} = \begin{pmatrix} \sum_i y_i \\ \sum_i x_{i1}y_i \\ \vdots \\ \sum_i x_{ik}y_i \end{pmatrix} \]

Say \(\pmb{\hat{\beta}}=(\pmb{X'X})^{-1}\pmb{X'y}\), then

\[\pmb{\hat{y}}=\pmb{\hat{\beta}X}\] are called the fitted values and

\[\pmb{\hat{\epsilon}} = \pmb{y}-\pmb{X\hat{\beta}}=\pmb{y}-\pmb{\hat{y}}\]

are called the residuals.

Example (6.2.4)

Let’s study a simple regression problem as a special case of a multiple regression problem. Then we have

\[ \pmb{X} = \begin{pmatrix} 1 & x_1 \\ \vdots& \vdots \\ 1 & x_n \\ \end{pmatrix} \]

\[ \pmb{X'X} = \begin{pmatrix} n & \sum_i x_{i} \\ \sum_i x_{i} & \sum_i x_{i}^2 \end{pmatrix}\\ \pmb{X'y} = \begin{pmatrix} \sum_i y_i \\ \sum_i x_{i}y_i \\ \end{pmatrix}\\ (\pmb{X'X})^{-1} = \frac1{n\sum_i x_{i}^2-(\sum_i x_{i})^2} \begin{pmatrix} \sum_i x_{i}^2 & -\sum_i x_{i} \\ -\sum_i x_{i} & n \end{pmatrix}\\ \text{ }\\ (\pmb{X'X})^{-1}\pmb{X'y} = \\ \frac1{n\sum_i x_{i}^2-(\sum_i x_{i})^2} \begin{pmatrix} \sum_i x_{i}^2 & -\sum_i x_{i} \\ -\sum_i x_{i} & n \end{pmatrix} \begin{pmatrix} \sum_i y_i \\ \sum_i x_{i}y_i \\ \end{pmatrix}=\\ \frac1{n\sum_i x_{i}^2-(\sum_i x_{i})^2} \begin{pmatrix} (\sum_i x_{i}^2)(\sum_i y_i)-\sum_i x_{i}\sum_i x_{i}y_i \\ (-\sum_i x_{i})(\sum_i y_i)+n\sum_i x_{i}y_i \end{pmatrix} \]

which is the same (6.1.2)

Example (6.2.4a)

Another special case is the no-intercept model \(y_i=\beta x_i+\epsilon_i\), so

\[\pmb{X} = \begin{pmatrix} x_1 & ... &x_n \end{pmatrix}'\]

\[\pmb{X'X}=\sum x_i^2\]

and

\[\hat{\beta} = (\pmb{X'X})^{-1}\pmb{X'y}=\frac{\sum x_iy_i}{\sum x_i^2}\]

Properties of Least Squares Estimators

We will assume that \(\pmb{X}\) is fixed and of full rank, as long as not stated otherwise.

Theorem (6.2.5)

If \(E[\pmb{y}]=\pmb{X\beta}\), then \(\pmb{\hat{\beta}}\) is an unbiased estimator of \(\pmb{\beta}\).

proof

\[ \begin{aligned} &E[\pmb{\hat{\beta}}] = E[\pmb{(X'X})^{-1}\pmb{X'y}] \\ &(\pmb{X'X})^{-1}\pmb{X'}E[\pmb{y}] = \\ &(\pmb{X'X})^{-1}\pmb{X'X\beta} = \pmb{\beta}\\ \end{aligned} \]

Theorem (6.2.6)

if \(cov(\pmb{y})=\sigma^2\pmb{I}\), then \(cov(\pmb{\hat{\beta}})=\sigma^2(\pmb{X'X})^{-1}\)

proof

Using (5.1.11) we have

\[ \begin{aligned} &cov(\pmb{\hat{\beta}}) = \\ &cov(\pmb{(X'X})^{-1}\pmb{X'y}) = \\ &[\pmb{(X'X})^{-1}\pmb{X'}]cov(\pmb{y})[\pmb{(X'X})^{-1}\pmb{X'}]' = \\ &\pmb{(X'X})^{-1}\pmb{X'}\sigma^2\pmb{X}\pmb{(X'X})^{-1}= \\ &\sigma^2\pmb{(X'X})^{-1}(\pmb{X'}\pmb{X})\pmb{(X'X})^{-1} =\\ &\sigma^2\pmb{(X'X})^{-1} \end{aligned} \]

Example (6.2.7)

In the case of simple linear regression we have

\[ \begin{aligned} &cov(\pmb{\hat{\beta}}) = cov\begin{pmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{pmatrix}=\\ &\begin{pmatrix} var(\hat{\beta}_0) & cov(\hat{\beta}_0,\hat{\beta}_1) \\ cov(\hat{\beta}_0,\hat{\beta}_1) & var(\hat{\beta}_1) \end{pmatrix} = \sigma^2(\pmb{X'X})^{-1} = \\ &\sigma^2\frac1{n\sum_i x_{i}^2-(\sum_i x_{i})^2} \begin{pmatrix} \sum_i x_{i}^2 & -\sum_i x_{i} \\ -\sum_i x_{i} & n \end{pmatrix} = \\ &\frac{\sigma^2}{\sum_i (x_{i}- \bar{x})^2} \begin{pmatrix} \overline{x^2} & -\bar{x} \\ -\bar{x} & 1 \end{pmatrix} \\ \end{aligned} \]

so we find

\[ \begin{aligned} &var(\hat{\beta}_0) =\frac{\sigma^2\overline{x^2}}{\sum_i (x_{i}- \bar{x})^2} \\ &var(\hat{\beta}_1) =\frac{\sigma^2}{\sum_i (x_{i}- \bar{x})^2} \\ &cov(\hat{\beta}_0,\hat{\beta}_1) =\frac{-\sigma^2\overline{x}}{\sum_i (x_{i}- \bar{x})^2} \end{aligned} \]

Note that if \(\bar{x}>0\), \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are negatively correlated, and vice versa.

Example (6.2.7a)

For the no-intercept model we find

\[var(\hat{\beta}) =\sigma^2(\pmb{X'X})^{-1} = \sigma^2/(\sum_i x_i^2)\]

Theorem (6.2.8)

Gauss-Markov

If \(E[\pmb{y}]=\pmb{X\beta}\) and \(cov(\pmb{y})=\sigma^2\pmb{I}\), then the least squares estimators have the smallest variance among all linear unbiased estimators.

proof

Any linear estimator can be written in the form \(\pmb{Ay}\). To be unbiased we have to have \(E[\pmb{Ay}]=\pmb{\beta}\), and therefore we find

\[E[\pmb{Ay}]=\pmb{A}E[\pmb{y}]=\pmb{AX\beta}=\pmb{\beta}\]

and since this has to hold for all possible \(\pmb{\beta}\) we have

\[\pmb{AX}=\pmb{I}\]

The covariance of \(\pmb{Ay}\) is given by

\[cov(\pmb{Ay}) = \pmb{A}cov(\pmb{y})\pmb{A}'= \sigma^2\pmb{A}\pmb{A}'\]

The variances of the \(\hat{\beta}_j\)’s are the diagonal elements of \(\sigma^2\pmb{A}\pmb{A}'\), and so we need to choose \(\pmb{A}\), subject to \(\pmb{AX}=\pmb{I}\) so that the diagonal elements are minimized.

Let’s write

\[\pmb{A}\pmb{A}' = (\pmb{A}-(\pmb{X'X})^{-1}\pmb{X}'+(\pmb{X'X})^{-1}\pmb{X}')(\pmb{A}-(\pmb{X'X})^{-1}\pmb{X}'+(\pmb{X'X})^{-1}\pmb{X}')=\\(a+b)(a+b)\]

where \(a=\pmb{A}-(\pmb{X'X})^{-1}\pmb{X}'\) and \(b=(\pmb{X'X})^{-1}\pmb{X}'\)

now \((a+b)(a+b)=a^2+ab+ba+b^2\), and we find

\[ \begin{aligned} &ab=\left(\pmb{A}-(\pmb{X'X})^{-1}\pmb{X}'\right)\left((\pmb{X'X})^{-1}\pmb{X}'\right) = \\ &\pmb{A}(\pmb{X'X})^{-1}\pmb{X}' - (\pmb{X'X})^{-1}\pmb{X}'(\pmb{X'X})^{-1}\pmb{X}' = \\ &\pmb{A}\pmb{X}(\pmb{X'X})^{-1} - (\pmb{X'X})^{-1}\pmb{X}'\pmb{X}(\pmb{X'X})^{-1} = \\ &\pmb{I}(\pmb{X'X})^{-1} - (\pmb{X'X})^{-1} = \pmb{0}\\ \end{aligned} \]

also \(ba=0\) and

\[ \begin{aligned} &b^2 = (\pmb{X'X})^{-1}\pmb{X}'(\pmb{X'X})^{-1}\pmb{X}' = \\ &(\pmb{X'X})^{-1}(\pmb{X'X})^{-1}\pmb{X'X} = \\ &(\pmb{X'X})^{-1} \end{aligned} \] because \(\pmb{X'X}\) is symmetric. So we find

\[\pmb{A}\pmb{A}' =\left(\pmb{A}-(\pmb{X'X})^{-1}\pmb{X}'\right)\left(\pmb{A}-(\pmb{X'X})^{-1}\pmb{X}'\right)'+(\pmb{X'X})^{-1}\]

The matrix \(\left(\pmb{A}-(\pmb{X'X})^{-1}\pmb{X}'\right)\left(\pmb{A}-(\pmb{X'X})^{-1}\pmb{X}'\right)'\) is positive semidefinite by (4.2.10) and so the diagonal elements are greater or equal to 0. They are equal to 0 clearly if \(\pmb{A}=(\pmb{X'X})^{-1}\pmb{X}'\).

Comment

This theorem is quite remarkable because it has NO requirements on the distribution of the \(\pmb{y}\)’s!

Definition (6.2.9)

An estimator is called BLUE if it is minimum variance unbiased and linear.

So the Gauss-Markov theorem states that (under its conditions) least squares estimators are BLUE.

Corollary (6.2.10)

if \(E[\pmb{y}]=\pmb{X\beta}\) and \(cov(\pmb{y})=\sigma^2\pmb{I}\), then the BLUE estimator of \(\pmb{a'y}\) is \(\pmb{a'\hat{\beta}}\).


The theorem also shows that the variance of the estimator depends on \(\pmb{X}\). As this is assumed to be fixed before the experiment, it is often under the control of the researcher. In this case it is often a good idea to choose \(\pmb{X}\) to be orthogonal so that \(\pmb{X'X}\) is diagonal. This often leads to maximizing the power of a hypothesis test.

Theorem (6.2.11)

If \(\pmb{x}= (1, x_1, .., x_k)'\) and \(\pmb{z}= (1, c_kx_1, .., c_kx_k)'\), then

\[\pmb{\hat{\beta}}'\pmb{x}=\pmb{\hat{\beta}}_z'\pmb{z}\]

where \(\pmb{\hat{\beta}}_z\) is the least squares estimator of the regression of \(\pmb{y}\) on \(\pmb{z}\).

proof omitted

This theorem states that least squares estimators are invariant under simple scalar multiplication, or changes of scale.

Corollary (6.2.12)

\(\pmb{\hat{y}}\) is invariant to a full-rank transformation of \(\pmb{X}\)

Estimation of \(\sigma^2\)

As in simple regression, least squares does not give us an estimate of \(\sigma^2\). By the assumptions we have \(E[y_i]= \pmb{x}_i' \pmb{\beta}\) and so

\[\sigma^2 = E[(y_i-\pmb{x}_i' \pmb{\beta})^2]\]

Again we can estimate \(\sigma^2\) with the average of these terms, so

\[s^2=\frac1{n-k-1}\sum (y_i-\pmb{x}_i' \pmb{\hat{\beta}})^2\]

where we use n-k-1 so that \(s^2\) is unbiased, see below. Note that by the above corollary \(\pmb{x}_i' \pmb{\hat{\beta}}\) is BLUE for \(\pmb{x}_i' \pmb{\beta}\).

By the proof of (6.2.2) we can write

\[ \begin{aligned} &s^2=\frac1{n-k-1}\sum (y_i-\pmb{x}_i' \pmb{\hat{\beta}})^2 = \\ &\frac1{n-k-1} (\pmb{y}-\pmb{X\hat{\beta}})'(\pmb{y}-\pmb{X\hat{\beta}}) = \\ &\frac1{n-k-1} (\pmb{y}'\pmb{y}-\pmb{\hat{\beta}'X'y}) = \frac{\text{SSE}}{n-k-1}\\ \end{aligned} \]

Theorem (6.2.13)

If \(E[\pmb{y}]=\pmb{X\beta}\) and \(cov(\pmb{y})=\sigma^2\pmb{I}\), then

\[E[s^2]=\sigma^2\]

proof

\[ \begin{aligned} &\text{SSE} = \\ &\pmb{y}'\pmb{y}-\pmb{\hat{\beta}'X'y} = \\ &\pmb{y}'\pmb{y}-[(\pmb{X'X})^{-1}\pmb{X'y}]'\pmb{X'y} = \\ &\pmb{y}'\pmb{y}-\pmb{y'X}(\pmb{X'X})^{-1}\pmb{X'y} = \\ &\pmb{y}'\left[ \pmb{I}-\pmb{X}(\pmb{X'X})^{-1}\pmb{X}'\right]\pmb{y} \end{aligned} \]

Using (5.3.3) we find

\[ \begin{aligned} &E[\text{SSE} ] = \\ &tr\left\{\left[ \pmb{I}-\pmb{X}(\pmb{X'X})^{-1}\pmb{X}'\right]\sigma^2\pmb{I} \right\} +E[\pmb{y}']\left[ \pmb{I}-\pmb{X}(\pmb{X'X})^{-1}\pmb{X}'\right]E[\pmb{y}'] = \\ &\sigma^2tr\left\{\left[ \pmb{I}-\pmb{X}(\pmb{X'X})^{-1}\pmb{X}'\right] \right\} +\pmb{\beta' X'}\left[ \pmb{I}-\pmb{X}(\pmb{X'X})^{-1}\pmb{X}'\right]\pmb{X\beta}' = \\ &\sigma^2\left\{n-tr(\pmb{X}(\pmb{X'X})^{-1}\pmb{X}') \right\} +\pmb{\beta'X'X\beta}-\pmb{\beta'X'X\beta} = \\ &\sigma^2\left\{n-tr(\pmb{X}(\pmb{X'X})^{-1}\pmb{X}') \right\} =\\ &\sigma^2\left\{n-tr(\pmb{I}_{k+1}) \right\} = (n-k-1)\sigma^2 \end{aligned} \]

because \(\pmb{X'X}\) is a \((k+1)\times (k+1)\) matrix.

Corollary (6.2.14)

\(s^2(\pmb{X'X})^{-1}\) is an unbiased estimator of \(cov(\pmb{\hat{\beta}})\).


Theorem (6.2.15)

If \(E[\pmb{\epsilon}]=\pmb{0}\), \(cov(\pmb{\epsilon})=\sigma^2\pmb{I}\) and \(E[\epsilon_i^4]=3\sigma^4\), then \(s^2\) is the best (minimum variance) quadratic unbiased estimator of \(\sigma^2\).

proof omitted

Example (6.2.16)

Say \(Z\sim N(0, 1)\), then

\[ \begin{aligned} &\int_{-\infty}^\infty x^4 \exp \left\{-x^2/2 \right\}dx = \\ &\int_{-\infty}^\infty x^3 [x\exp \left\{-x^2/2] \right\}dx = \\ &x^3(-\exp \left\{-x^2/2 \right\}\vert_{-\infty}^\infty - \int_{-\infty}^\infty 3x^2 (-\exp \left\{-x^2/2 \right\})dx = \\ &3\int_{-\infty}^\infty x^2 \exp \left\{-x^2/2 \right\})dx\\ &E[Z^4] =\int_{-\infty}^\infty x^4\frac1{\sqrt{2\pi}}\exp \left\{-x^2/2 \right\}dx = \\ &3\int_{-\infty}^\infty x^2 \frac1{\sqrt{2\pi}}\exp \left\{-x^2/2 \right\}dx = 3\\ \end{aligned} \]

Let \(X\sim N(0, \sigma^2)\), then

\[E[X^4]=E[(\sigma Z)^4] = 3\sigma^4\]

so the condition of theorem 6.2.15 is fulfilled it the residuals have a normal distribution.

Example (6.2.17)

For the houseprice data we find

A=as.matrix(houseprice)
n=nrow(A)
y=A[, 1, drop=FALSE]
X=cbind(1, A[, -1])
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
sse=c(t(y)%*%y-t(betahat)%*%t(X)%*%y)
sse
## [1] 4321.864
sse/(n-4-1)
## [1] 187.9071