Loading [MathJax]/jax/output/HTML-CSS/jax.js

Multiple Regression

The Model

Definition (6.2.1)

We have a response vector yy=(y1...yn), a vector of regression coefficients ββ=(β0β1...βk) and a predictor matrix

XX=(1x11...x1k1xn1xnk)

We also have a vector of errors ϵϵ=(ϵ1...ϵn). Then a model of the form

yi=β0+β1xi1+..+βkxik i=1,..,n.

or

yy=XβXβ+ϵϵ

is called a multiple regression model. XX is called the design matrix. As in the simple regression case we assume for now that XX is fixed and not random.

Note that this includes models like

yi=β0+β1xi+β2x2i

yi=β0+β1log(xi)

because they are models linear in the coefficients β.

The assumptions are the same as in the simple regression model:

  1. E[ϵi]=0 (model is correct)
  2. var(ϵi)=σ2 (equal variance, homoscadasticity)
  3. cov(ϵi,ϵj)=0 (independence)

Estimation of ββ and σ2

Analogous to the simple regression case we can use the method of least squares and estimate the coefficients by minimizing

ni=1(yiβ0β1xi1..βkxik)2

Again one can differentiate this expression and solve the resulting system of equations, however here is a better solution:

Theorem (6.2.2)

If yy=XβXβ+ϵϵ where XX is a n×(k+1) matrix of rank k+1<n, then the vector ˆβˆβ that minimizes the least squares criterion is given by

ˆβˆβ=(XX)(XX)1XyXy

proof

we have

ˆϵˆϵˆϵˆϵ=(yyXˆβXˆβ)(yyXˆβXˆβ)=yy(yyXˆβXˆβ)(XˆβXˆβ)(yyXˆβXˆβ)=yyyyyyXˆβXˆβy(Xˆβy(Xˆβ)+ˆβXˆβXXˆβXˆβ=yyyy2yyXˆβXˆβ+ˆβXˆβXXˆβXˆβ because yyXˆβXˆβ is a scalar.

By (4.3.20) and (4.3.21) we have

ˆϵˆϵˆϵˆϵ/ββ=(yyyy2yyXˆβXˆβ+ˆβXˆβXXˆβXˆβ)/ββ=(yyyy)/ββ(2yyXˆβ)Xˆβ)/ββ+(ˆβXˆβXXˆβXˆβ)/ββ=002XXyy+2XXβXXβ=00XXˆβXXˆβ=XXyy

XXXX is full-rank and therefore has an inverse, and so we have the result.

Definition

XXˆβXXˆβ=XXyy

are called the normal equations.

Example (6.2.3)

Prices of 28 residencies located 30 miles south of a large metropolitan area.

kable.nice(houseprice, do.row.names = FALSE)
Price Sqfeet Floors Bedrooms Baths
69.0 1500.000 1 2 1.0
118.5 1880.952 1 2 2.0
104.0 1976.190 1 3 2.0
116.5 1880.952 1 3 2.0
121.5 1880.952 1 3 2.0
125.0 1976.190 1 3 2.0
128.0 2357.143 2 3 2.5
129.9 2166.667 1 3 1.7
133.0 2166.667 2 3 2.5
135.0 2166.667 2 3 2.5
137.5 2357.143 2 3 2.5
139.9 2166.667 1 3 2.0
143.9 2261.905 2 3 2.5
147.9 2547.619 2 3 2.5
154.9 2357.143 2 3 2.5
160.0 2738.095 2 3 2.0
169.0 2357.143 1 3 2.0
169.9 2642.857 1 3 2.0
125.0 2166.667 1 4 2.0
134.9 2166.667 1 4 2.0
139.9 2547.619 1 4 2.0
147.0 2642.857 1 4 2.0
159.0 2261.905 1 4 2.0
169.9 2547.619 2 4 3.0
178.9 2738.095 1 4 2.0
194.5 2833.333 2 4 3.0
219.9 2928.571 1 4 2.5
269.0 3309.524 2 4 3.0

Here we want to predict the Price from the other four variables, so

par(mfrow=c(2, 2))
plot(houseprice$Sqfeet, houseprice$Price, pch=20)
boxplot(houseprice$Floors, houseprice$Price)
boxplot(houseprice$Bedrooms, houseprice$Price)
boxplot(houseprice$Baths, houseprice$Price)

A=as.matrix(houseprice)
y=A[, 1, drop=FALSE]
X=cbind(1, A[, -1])
beta= solve(t(X)%*%X)%*%t(X)%*%y
round(c(beta), 3)
## [1] -67.620   0.086 -26.493  -9.286  37.381

We can write

XXXX=(nixi1ixi2...ixikixi1ix2i1ixi1xi2...ixi1xik...ixikixi1xikixi2xi2...ix2ik)

and

XyXy=(iyiixi1yiixikyi)

Say ˆβˆβ=(XXXX)1XyXy, then

ˆyˆy=ˆβXˆβX are called the fitted values and

ˆϵˆϵ=yyXˆβXˆβ=yyˆyˆy

are called the residuals.

Example (6.2.4)

Let’s study a simple regression problem as a special case of a multiple regression problem. Then we have

XX=(1x11xn)

XXXX=(nixiixiix2i)XyXy=(iyiixiyi)(XXXX)1=1nix2i(ixi)2(ix2iixiixin) (XXXX)1XyXy=1nix2i(ixi)2(ix2iixiixin)(iyiixiyi)=1nix2i(ixi)2((ix2i)(iyi)ixiixiyi(ixi)(iyi)+nixiyi)

which is the same (6.1.2)

Example (6.2.4a)

Another special case is the no-intercept model yi=βxi+ϵi, so

XX=(x1...xn)

XXXX=x2i

and

ˆβ=(XXXX)1XyXy=xiyix2i

Properties of Least Squares Estimators

We will assume that XX is fixed and of full rank, as long as not stated otherwise.

Theorem (6.2.5)

If E[yy]=XβXβ, then ˆβˆβ is an unbiased estimator of ββ.

proof

E[ˆβˆβ]=E[(XX(XX)1XyXy](XXXX)1XXE[yy]=(XXXX)1XXβXXβ=ββ

Theorem (6.2.6)

if cov(yy)=σ2II, then cov(ˆβˆβ)=σ2(XXXX)1

proof

Using (5.1.11) we have

cov(ˆβˆβ)=cov((XX(XX)1XyXy)=[(XX(XX)1XX]cov(yy)[(XX(XX)1XX]=(XX(XX)1XXσ2XX(XX(XX)1=σ2(XX(XX)1(XXXX)(XX(XX)1=σ2(XX(XX)1

Example (6.2.7)

In the case of simple linear regression we have

cov(ˆβˆβ)=cov(ˆβ0ˆβ1)=(var(ˆβ0)cov(ˆβ0,ˆβ1)cov(ˆβ0,ˆβ1)var(ˆβ1))=σ2(XXXX)1=σ21nix2i(ixi)2(ix2iixiixin)=σ2i(xiˉx)2(¯x2ˉxˉx1)

so we find

var(ˆβ0)=σ2¯x2i(xiˉx)2var(ˆβ1)=σ2i(xiˉx)2cov(ˆβ0,ˆβ1)=σ2¯xi(xiˉx)2

Note that if ˉx>0, ˆβ0 and ˆβ1 are negatively correlated, and vice versa.

Example (6.2.7a)

For the no-intercept model we find

var(ˆβ)=σ2(XXXX)1=σ2/(ix2i)

Theorem (6.2.8)

Gauss-Markov

If E[yy]=XβXβ and cov(yy)=σ2II, then the least squares estimators have the smallest variance among all linear unbiased estimators.

proof

Any linear estimator can be written in the form AyAy. To be unbiased we have to have E[AyAy]=ββ, and therefore we find

E[AyAy]=AAE[yy]=AXβAXβ=ββ

and since this has to hold for all possible ββ we have

AXAX=II

The covariance of AyAy is given by

cov(AyAy)=AAcov(yy)AA=σ2AAAA

The variances of the ˆβj’s are the diagonal elements of σ2AAAA, and so we need to choose AA, subject to AXAX=II so that the diagonal elements are minimized.

Let’s write

AAAA=(AA(XXXX)1XX+(XXXX)1XX)(AA(XXXX)1XX+(XXXX)1XX)=(a+b)(a+b)

where a=AA(XXXX)1XX and b=(XXXX)1XX

now (a+b)(a+b)=a2+ab+ba+b2, and we find

ab=(AA(XXXX)1XX)((XXXX)1XX)=AA(XXXX)1XX(XXXX)1XX(XXXX)1XX=AAXX(XXXX)1(XXXX)1XXXX(XXXX)1=II(XXXX)1(XXXX)1=00

also ba=0 and

b2=(XXXX)1XX(XXXX)1XX=(XXXX)1(XXXX)1XXXX=(XXXX)1 because XXXX is symmetric. So we find

AAAA=(AA(XXXX)1XX)(AA(XXXX)1XX)+(XXXX)1

The matrix (AA(XXXX)1XX)(AA(XXXX)1XX) is positive semidefinite by (4.2.10) and so the diagonal elements are greater or equal to 0. They are equal to 0 clearly if AA=(XXXX)1XX.

Comment

This theorem is quite remarkable because it has NO requirements on the distribution of the yy’s!

Definition (6.2.9)

An estimator is called BLUE if it is minimum variance unbiased and linear.

So the Gauss-Markov theorem states that (under its conditions) least squares estimators are BLUE.

Corollary (6.2.10)

if E[yy]=XβXβ and cov(yy)=σ2II, then the BLUE estimator of ayay is aˆβaˆβ.


The theorem also shows that the variance of the estimator depends on XX. As this is assumed to be fixed before the experiment, it is often under the control of the researcher. In this case it is often a good idea to choose XX to be orthogonal so that XXXX is diagonal. This often leads to maximizing the power of a hypothesis test.

Theorem (6.2.11)

If xx=(1,x1,..,xk) and zz=(1,ckx1,..,ckxk), then

ˆβˆβxx=ˆβˆβzzz

where ˆβˆβz is the least squares estimator of the regression of yy on zz.

proof omitted

This theorem states that least squares estimators are invariant under simple scalar multiplication, or changes of scale.

Corollary (6.2.12)

ˆyˆy is invariant to a full-rank transformation of XX

Estimation of σ2

As in simple regression, least squares does not give us an estimate of σ2. By the assumptions we have E[yi]=xxiββ and so

σ2=E[(yixxiββ)2]

Again we can estimate σ2 with the average of these terms, so

s2=1nk1(yixxiˆβˆβ)2

where we use n-k-1 so that s2 is unbiased, see below. Note that by the above corollary xxiˆβˆβ is BLUE for xxiββ.

By the proof of (6.2.2) we can write

s2=1nk1(yixxiˆβˆβ)2=1nk1(yyXˆβXˆβ)(yyXˆβXˆβ)=1nk1(yyyyˆβXyˆβXy)=SSEnk1

Theorem (6.2.13)

If E[yy]=XβXβ and cov(yy)=σ2II, then

E[s2]=σ2

proof

SSE=yyyyˆβXyˆβXy=yyyy[(XXXX)1XyXy]XyXy=yyyyyXyX(XXXX)1XyXy=yy[IIXX(XXXX)1XX]yy

Using (5.3.3) we find

E[SSE]=tr{[IIXX(XXXX)1XX]σ2II}+E[yy][IIXX(XXXX)1XX]E[yy]=σ2tr{[IIXX(XXXX)1XX]}+βXβX[IIXX(XXXX)1XX]XβXβ=σ2{ntr(XX(XXXX)1XX)}+βXXββXXββXXββXXβ=σ2{ntr(XX(XXXX)1XX)}=σ2{ntr(IIk+1)}=(nk1)σ2

because XXXX is a (k+1)×(k+1) matrix.

Corollary (6.2.14)

s2(XXXX)1 is an unbiased estimator of cov(ˆβˆβ).


Theorem (6.2.15)

If E[ϵϵ]=00, cov(ϵϵ)=σ2II and E[ϵ4i]=3σ4, then s2 is the best (minimum variance) quadratic unbiased estimator of σ2.

proof omitted

Example (6.2.16)

Say ZN(0,1), then

x4exp{x2/2}dx=x3[xexp{x2/2]}dx=x3(exp{x2/2}|3x2(exp{x2/2})dx=3x2exp{x2/2})dxE[Z4]=x412πexp{x2/2}dx=3x212πexp{x2/2}dx=3

Let XN(0,σ2), then

E[X4]=E[(σZ)4]=3σ4

so the condition of theorem 6.2.15 is fulfilled it the residuals have a normal distribution.

Example (6.2.17)

For the houseprice data we find

A=as.matrix(houseprice)
n=nrow(A)
y=A[, 1, drop=FALSE]
X=cbind(1, A[, -1])
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
sse=c(t(y)%*%y-t(betahat)%*%t(X)%*%y)
sse
## [1] 4321.864
sse/(n-4-1)
## [1] 187.9071