We now add the assumption
\[\pmb{y}\sim N_n(\pmb{X\beta}, \sigma^2\pmb{I})\]
Under this assumption it is now possible to use maximum likelihood for estimation. We denote the likelihood function by \(L(\pmb{\beta}, \sigma^2)\) and the log likelihood function by \(l(\pmb{\beta}, \sigma^2)= \log L(\pmb{\beta}, \sigma^2)\).
If \(\pmb{y}\sim N_n(\pmb{X\beta}, \sigma^2\pmb{I})\) where \(\pmb{X}\) is \(n\times k+1\) with full rank k+1<n, the maximum likelihood estimators are
\[ \begin{aligned} &\pmb{\hat{\beta}} = (\pmb{X}'\pmb{X})^{-1}\pmb{X}'\pmb{y} = \\ &\hat{\sigma}^2 =\frac1n (\pmb{y}-\pmb{X\hat{\beta}})'(\pmb{y}-\pmb{X\hat{\beta}}) \end{aligned} \]
so the mle of \(\pmb{\beta}\) is the least squares estimator.
proof (sketch)
\[l(\pmb{\beta}, \sigma^2) = -\frac{n}2\log(2\pi)-\frac{n}2\log\sigma^2-\frac1{2\sigma^2}(\pmb{y}-\pmb{X\beta})'(\pmb{y}-\pmb{X\beta})\]
taking partial derivatives and setting them equal to 0 yields the desired result.
Under the assumptions of theorem (6.4.1) we find
\(\pmb{\hat{\beta}}\sim N_{k+1}(\pmb{\beta}, \sigma^2(\pmb{X}'\pmb{X})^{-1})\)
\(n\hat{\sigma}^2 /\sigma^2\sim \chi^2(n-k-1)\)
\(\pmb{\hat{\beta}}\) and \(\hat{\sigma}^2\) are independent
proof
i follows from (5.2.8) and ii and iii have been shown before for the least squares estimators
Under the assumptions of theorem (6.4.1) \(\pmb{\hat{\beta}}\) and \(\hat{\sigma}^2\) are jointly sufficient statistics
proof omitted
Under the assumptions of theorem (6.4.1) \(\pmb{\hat{\beta}}\) and \(\hat{\sigma}^2\) have minimum variance among all unbiased estimators.
proof omitted
Note that this is a much stronger result than the Gauss-Markov theorem, which was about linear unbiased estimators.
We can partition SST as follows: SST=SSR+SSE, where
\[\text{SSR} = \pmb{\hat{\beta}}_1'\pmb{X}_c'\pmb{y}\]
is called the regression sum of squares.
\[R^2 = \frac{\text{SSR}}{\text{SST}} = \frac{\pmb{\hat{\beta}}_1'\pmb{X}_c'\pmb{X}_c\pmb{\hat{\beta}}_1}{\sum (y_i-\bar{y})^2} =\frac{\pmb{\hat{\beta}}'\pmb{X}'\pmb{y}-n\bar{y}^2}{\pmb{y'y}-n\bar{y}^2}=\frac{\sum_i(\hat{y}_i-\bar{y})^2}{\sum_i(y_i-\bar{y})^2} \]
is called the coefficient of determination. It provides a measure of how well the model fits the data.
For the houseprice data we find
A=as.matrix(houseprice)
n=nrow(A)
y=A[, 1, drop=FALSE]
X=cbind(1, A[, -1])
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
(t(betahat)%*%t(X)%*%y-n*mean(y)^2)/(t(y)%*%y-n*mean(y)^2)
## Price
## Price 0.8862443
so this model has an R2 of 88.6%.
Of course we can also use R:
summary(lm(Price~., data=houseprice))
##
## Call:
## lm(formula = Price ~ ., data = houseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.018 -5.943 1.860 5.947 30.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.61984 17.70818 -3.819 0.000882
## Sqfeet 0.08571 0.01076 7.966 4.62e-08
## Floors -26.49306 9.48952 -2.792 0.010363
## Bedrooms -9.28622 6.82985 -1.360 0.187121
## Baths 37.38067 12.26436 3.048 0.005709
##
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared: 0.8862, Adjusted R-squared: 0.8665
## F-statistic: 44.8 on 4 and 23 DF, p-value: 1.558e-10
Here are some properties of R2:
\(0<R^2<1\)
if \(\pmb{\beta}=0\) \(R^2=0\)
\(R = cor(\pmb{y},\pmb{\hat{y}})^2\), that is \(R^2\) is the square of the correlation between the observed and the predicted y’s.
fit=lm(Price~., data=houseprice)
summary(fit)
##
## Call:
## lm(formula = Price ~ ., data = houseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.018 -5.943 1.860 5.947 30.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.61984 17.70818 -3.819 0.000882
## Sqfeet 0.08571 0.01076 7.966 4.62e-08
## Floors -26.49306 9.48952 -2.792 0.010363
## Bedrooms -9.28622 6.82985 -1.360 0.187121
## Baths 37.38067 12.26436 3.048 0.005709
##
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared: 0.8862, Adjusted R-squared: 0.8665
## F-statistic: 44.8 on 4 and 23 DF, p-value: 1.558e-10
yhat=predict(fit)
cor(houseprice$Price, yhat)^2
## [1] 0.8862443
Some more properties of \(R^2\):
Adding a variable to the predictors never decreases R2.
if \(\pmb{\beta}=0\), then \(E[R^2]=k/(n-1)\)
R2 is invariant under a full-rank linear transformation of the x’s and under a scale transformation of the y’s, but not a linear transformation of the x’s and the y’s simultaneously.
\(R^2 = \pmb{s}_{yx}'\pmb{S}_{xx}^{-1}\pmb{s}_{yx}/s_y^2\)
x1=sort(runif(100))
x2=sort(runif(100))
y= cbind(1+2*x1+rnorm(100))
X=cbind(1, x1)
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
yhat=X%*%betahat
cor(y,yhat)^2
## [,1]
## [1,] 0.2402542
X=cbind(1, x1, x2)
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
yhat=X%*%betahat
cor(y,yhat)^2
## [,1]
## [1,] 0.2516723
and so we see that although x2 is independent of y, including it in the model still increases \(R^2\), at least by a little bit.
R2=rep(0, 1000)
for(i in 1:1000) {
x1=sort(runif(10))
x2=sort(runif(10))
y= cbind(1+rnorm(10))
X=cbind(1, x1, x2)
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
yhat=X%*%betahat
R2[i]=cor(y,yhat)^2
}
round(c(mean(R2), 2/9), 3)
## [1] 0.218 0.222
x1=sort(runif(100))
x2=sort(runif(100))
y= cbind(1+2*x1+3*x2+rnorm(100))
X=cbind(1, x1, x2)
betahat= (solve(t(X)%*%X)%*%t(X))%*%y
yhat=X%*%betahat
cor(y,yhat)^2
## [,1]
## [1,] 0.6903922
#scale transformation of y
y1=rnorm(1)*y
betahat= (solve(t(X)%*%X)%*%t(X))%*%y1
yhat=X%*%betahat
cor(y1,yhat)^2
## [,1]
## [1,] 0.6903922
#linear transformation of x's
z1=rnorm(1)*x1+rnorm(1)*x2
z2=rnorm(1)*x1+rnorm(1)*x2
Z=cbind(1, z1, z2)
betahat= (solve(t(Z)%*%Z)%*%t(Z))%*%y
yhat=Z%*%betahat
cor(y,yhat)^2
## [,1]
## [1,] 0.6903922