We have a response vector yy=(y1...yn)′, a vector of regression coefficients ββ=(β0β1...βk)′ and a predictor matrix
XX=(1x11...x1k⋮⋮⋮1xn1xnk)
We also have a vector of errors ϵϵ=(ϵ1...ϵn)′. Then a model of the form
yi=β0+β1xi1+..+βkxik i=1,..,n.
or
yy=XβXβ+ϵϵ
is called a multiple regression model. XX is called the design matrix. As in the simple regression case we assume for now that XX is fixed and not random.
Note that this includes models like
yi=β0+β1xi+β2x2i
yi=β0+β1log(xi)
because they are models linear in the coefficients β.
The assumptions are the same as in the simple regression model:
Analogous to the simple regression case we can use the method of least squares and estimate the coefficients by minimizing
n∑i=1(yi−β0−β1xi1−..−βkxik)2
Again one can differentiate this expression and solve the resulting system of equations, however here is a better solution:
If yy=XβXβ+ϵϵ where XX is a n×(k+1) matrix of rank k+1<n, then the vector ˆβˆβ that minimizes the least squares criterion is given by
ˆβˆβ=(X′X)(X′X)−1X′yX′y
proof
we have
ˆϵ′ˆϵˆϵ′ˆϵ=(yy−XˆβXˆβ)′(yy−XˆβXˆβ)=yy′(yy−XˆβXˆβ)−(XˆβXˆβ)′(yy−XˆβXˆβ)=yy′yy−yy′XˆβXˆβ−y′(Xˆβy′(Xˆβ)+ˆβ′X′ˆβ′X′XˆβXˆβ=yy′yy−2yy′XˆβXˆβ+ˆβ′X′ˆβ′X′XˆβXˆβ because yy′XˆβXˆβ is a scalar.
By (4.3.20) and (4.3.21) we have
∂ˆϵ′ˆϵˆϵ′ˆϵ/∂ββ=∂(yy′yy−2yy′XˆβXˆβ+ˆβ′X′ˆβ′X′XˆβXˆβ)/∂ββ=∂(yy′yy)/∂ββ−∂(2yy′Xˆβ)Xˆβ)/∂ββ+∂(ˆβ′X′ˆβ′X′XˆβXˆβ)/∂ββ=00−2XX′yy+2X′XβX′Xβ=00X′XˆβX′Xˆβ=XX′yy
X′XX′X is full-rank and therefore has an inverse, and so we have the result.
Prices of 28 residencies located 30 miles south of a large metropolitan area.
kable.nice(houseprice, do.row.names = FALSE)
Price | Sqfeet | Floors | Bedrooms | Baths |
---|---|---|---|---|
69.0 | 1500.000 | 1 | 2 | 1.0 |
118.5 | 1880.952 | 1 | 2 | 2.0 |
104.0 | 1976.190 | 1 | 3 | 2.0 |
116.5 | 1880.952 | 1 | 3 | 2.0 |
121.5 | 1880.952 | 1 | 3 | 2.0 |
125.0 | 1976.190 | 1 | 3 | 2.0 |
128.0 | 2357.143 | 2 | 3 | 2.5 |
129.9 | 2166.667 | 1 | 3 | 1.7 |
133.0 | 2166.667 | 2 | 3 | 2.5 |
135.0 | 2166.667 | 2 | 3 | 2.5 |
137.5 | 2357.143 | 2 | 3 | 2.5 |
139.9 | 2166.667 | 1 | 3 | 2.0 |
143.9 | 2261.905 | 2 | 3 | 2.5 |
147.9 | 2547.619 | 2 | 3 | 2.5 |
154.9 | 2357.143 | 2 | 3 | 2.5 |
160.0 | 2738.095 | 2 | 3 | 2.0 |
169.0 | 2357.143 | 1 | 3 | 2.0 |
169.9 | 2642.857 | 1 | 3 | 2.0 |
125.0 | 2166.667 | 1 | 4 | 2.0 |
134.9 | 2166.667 | 1 | 4 | 2.0 |
139.9 | 2547.619 | 1 | 4 | 2.0 |
147.0 | 2642.857 | 1 | 4 | 2.0 |
159.0 | 2261.905 | 1 | 4 | 2.0 |
169.9 | 2547.619 | 2 | 4 | 3.0 |
178.9 | 2738.095 | 1 | 4 | 2.0 |
194.5 | 2833.333 | 2 | 4 | 3.0 |
219.9 | 2928.571 | 1 | 4 | 2.5 |
269.0 | 3309.524 | 2 | 4 | 3.0 |
Here we want to predict the Price from the other four variables, so
par(mfrow=c(2, 2))
plot(houseprice$Sqfeet, houseprice$Price, pch=20)
boxplot(houseprice$Floors, houseprice$Price)
boxplot(houseprice$Bedrooms, houseprice$Price)
boxplot(houseprice$Baths, houseprice$Price)
A=as.matrix(houseprice)
y=A[, 1, drop=FALSE]
X=cbind(1, A[, -1])
beta= solve(t(X)%*%X)%*%t(X)%*%y
round(c(beta), 3)
## [1] -67.620 0.086 -26.493 -9.286 37.381
We can write
X′XX′X=(n∑ixi1∑ixi2...∑ixik∑ixi1∑ix2i1∑ixi1xi2...∑ixi1xik⋮⋮⋮...⋮∑ixik∑ixi1xik∑ixi2xi2...∑ix2ik)
and
X′yX′y=(∑iyi∑ixi1yi⋮∑ixikyi)
Say ˆβˆβ=(X′XX′X)−1X′yX′y, then
ˆyˆy=ˆβXˆβX are called the fitted values and
ˆϵˆϵ=yy−XˆβXˆβ=yy−ˆyˆy
are called the residuals.
Let’s study a simple regression problem as a special case of a multiple regression problem. Then we have
XX=(1x1⋮⋮1xn)
X′XX′X=(n∑ixi∑ixi∑ix2i)X′yX′y=(∑iyi∑ixiyi)(X′XX′X)−1=1n∑ix2i−(∑ixi)2(∑ix2i−∑ixi−∑ixin) (X′XX′X)−1X′yX′y=1n∑ix2i−(∑ixi)2(∑ix2i−∑ixi−∑ixin)(∑iyi∑ixiyi)=1n∑ix2i−(∑ixi)2((∑ix2i)(∑iyi)−∑ixi∑ixiyi(−∑ixi)(∑iyi)+n∑ixiyi)
which is the same (6.1.2)
Another special case is the no-intercept model yi=βxi+ϵi, so
XX=(x1...xn)′
X′XX′X=∑x2i
and
ˆβ=(X′XX′X)−1X′yX′y=∑xiyi∑x2i
We will assume that XX is fixed and of full rank, as long as not stated otherwise.
If E[yy]=XβXβ, then ˆβˆβ is an unbiased estimator of ββ.
proof
E[ˆβˆβ]=E[(X′X(X′X)−1X′yX′y](X′XX′X)−1X′X′E[yy]=(X′XX′X)−1X′XβX′Xβ=ββ
if cov(yy)=σ2II, then cov(ˆβˆβ)=σ2(X′XX′X)−1
proof
Using (5.1.11) we have
cov(ˆβˆβ)=cov((X′X(X′X)−1X′yX′y)=[(X′X(X′X)−1X′X′]cov(yy)[(X′X(X′X)−1X′X′]′=(X′X(X′X)−1X′X′σ2XX(X′X(X′X)−1=σ2(X′X(X′X)−1(X′X′XX)(X′X(X′X)−1=σ2(X′X(X′X)−1
In the case of simple linear regression we have
cov(ˆβˆβ)=cov(ˆβ0ˆβ1)=(var(ˆβ0)cov(ˆβ0,ˆβ1)cov(ˆβ0,ˆβ1)var(ˆβ1))=σ2(X′XX′X)−1=σ21n∑ix2i−(∑ixi)2(∑ix2i−∑ixi−∑ixin)=σ2∑i(xi−ˉx)2(¯x2−ˉx−ˉx1)
so we find
var(ˆβ0)=σ2¯x2∑i(xi−ˉx)2var(ˆβ1)=σ2∑i(xi−ˉx)2cov(ˆβ0,ˆβ1)=−σ2¯x∑i(xi−ˉx)2
Note that if ˉx>0, ˆβ0 and ˆβ1 are negatively correlated, and vice versa.
Gauss-Markov
If E[yy]=XβXβ and cov(yy)=σ2II, then the least squares estimators have the smallest variance among all linear unbiased estimators.
proof
Any linear estimator can be written in the form AyAy. To be unbiased we have to have E[AyAy]=ββ, and therefore we find
E[AyAy]=AAE[yy]=AXβAXβ=ββ
and since this has to hold for all possible ββ we have
AXAX=II
The covariance of AyAy is given by
cov(AyAy)=AAcov(yy)AA′=σ2AAAA′
The variances of the ˆβj’s are the diagonal elements of σ2AAAA′, and so we need to choose AA, subject to AXAX=II so that the diagonal elements are minimized.
Let’s write
AAAA′=(AA−(X′XX′X)−1XX′+(X′XX′X)−1XX′)(AA−(X′XX′X)−1XX′+(X′XX′X)−1XX′)=(a+b)(a+b)
where a=AA−(X′XX′X)−1XX′ and b=(X′XX′X)−1XX′
now (a+b)(a+b)=a2+ab+ba+b2, and we find
ab=(AA−(X′XX′X)−1XX′)((X′XX′X)−1XX′)=AA(X′XX′X)−1XX′−(X′XX′X)−1XX′(X′XX′X)−1XX′=AAXX(X′XX′X)−1−(X′XX′X)−1XX′XX(X′XX′X)−1=II(X′XX′X)−1−(X′XX′X)−1=00
also ba=0 and
b2=(X′XX′X)−1XX′(X′XX′X)−1XX′=(X′XX′X)−1(X′XX′X)−1X′XX′X=(X′XX′X)−1 because X′XX′X is symmetric. So we find
AAAA′=(AA−(X′XX′X)−1XX′)(AA−(X′XX′X)−1XX′)′+(X′XX′X)−1
The matrix (AA−(X′XX′X)−1XX′)(AA−(X′XX′X)−1XX′)′ is positive semidefinite by (4.2.10) and so the diagonal elements are greater or equal to 0. They are equal to 0 clearly if AA=(X′XX′X)−1XX′.
Comment
This theorem is quite remarkable because it has NO requirements on the distribution of the yy’s!
An estimator is called BLUE if it is minimum variance unbiased and linear.
So the Gauss-Markov theorem states that (under its conditions) least squares estimators are BLUE.
if E[yy]=XβXβ and cov(yy)=σ2II, then the BLUE estimator of a′ya′y is a′ˆβa′ˆβ.
The theorem also shows that the variance of the estimator depends on XX. As this is assumed to be fixed before the experiment, it is often under the control of the researcher. In this case it is often a good idea to choose XX to be orthogonal so that X′XX′X is diagonal. This often leads to maximizing the power of a hypothesis test.
If xx=(1,x1,..,xk)′ and zz=(1,ckx1,..,ckxk)′, then
ˆβˆβ′xx=ˆβˆβ′zzz
where ˆβˆβz is the least squares estimator of the regression of yy on zz.
proof omitted
This theorem states that least squares estimators are invariant under simple scalar multiplication, or changes of scale.
ˆyˆy is invariant to a full-rank transformation of XX
As in simple regression, least squares does not give us an estimate of σ2. By the assumptions we have E[yi]=xx′iββ and so
σ2=E[(yi−xx′iββ)2]
Again we can estimate σ2 with the average of these terms, so
s2=1n−k−1∑(yi−xx′iˆβˆβ)2
where we use n-k-1 so that s2 is unbiased, see below. Note that by the above corollary xx′iˆβˆβ is BLUE for xx′iββ.
By the proof of (6.2.2) we can write
s2=1n−k−1∑(yi−xx′iˆβˆβ)2=1n−k−1(yy−XˆβXˆβ)′(yy−XˆβXˆβ)=1n−k−1(yy′yy−ˆβ′X′yˆβ′X′y)=SSEn−k−1
If E[yy]=XβXβ and cov(yy)=σ2II, then
E[s2]=σ2
proof
SSE=yy′yy−ˆβ′X′yˆβ′X′y=yy′yy−[(X′XX′X)−1X′yX′y]′X′yX′y=yy′yy−y′Xy′X(X′XX′X)−1X′yX′y=yy′[II−XX(X′XX′X)−1XX′]yy
Using (5.3.3) we find
E[SSE]=tr{[II−XX(X′XX′X)−1XX′]σ2II}+E[yy′][II−XX(X′XX′X)−1XX′]E[yy′]=σ2tr{[II−XX(X′XX′X)−1XX′]}+β′X′β′X′[II−XX(X′XX′X)−1XX′]XβXβ′=σ2{n−tr(XX(X′XX′X)−1XX′)}+β′X′Xββ′X′Xβ−β′X′Xββ′X′Xβ=σ2{n−tr(XX(X′XX′X)−1XX′)}=σ2{n−tr(IIk+1)}=(n−k−1)σ2
because X′XX′X is a (k+1)×(k+1) matrix.
If E[ϵϵ]=00, cov(ϵϵ)=σ2II and E[ϵ4i]=3σ4, then s2 is the best (minimum variance) quadratic unbiased estimator of σ2.
proof omitted
Say Z∼N(0,1), then
∫∞−∞x4exp{−x2/2}dx=∫∞−∞x3[xexp{−x2/2]}dx=x3(−exp{−x2/2}|∞−∞−∫∞−∞3x2(−exp{−x2/2})dx=3∫∞−∞x2exp{−x2/2})dxE[Z4]=∫∞−∞x41√2πexp{−x2/2}dx=3∫∞−∞x21√2πexp{−x2/2}dx=3
Let X∼N(0,σ2), then
E[X4]=E[(σZ)4]=3σ4
so the condition of theorem 6.2.15 is fulfilled it the residuals have a normal distribution.
For the houseprice data we find
A=as.matrix(houseprice)
n=nrow(A)
y=A[, 1, drop=FALSE]
X=cbind(1, A[, -1])
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
sse=c(t(y)%*%y-t(betahat)%*%t(X)%*%y)
sse
## [1] 4321.864
sse/(n-4-1)
## [1] 187.9071