Recall that all our calculations have been based on a number of assumptions, namely
some or all of these assumptions will have to hold for our analysis, so the question arises, how does one check them? This is known as regression diagnostics.
\[\hat{\epsilon}_i = y_i-\hat{y}_i\]
\[\pmb{\hat{y}} = \pmb{X\hat{\beta}}=\pmb{X(X'X)}^{-1}\pmb{X'y}=\pmb{Hy}\]
so
\[\pmb{H}=\pmb{X(X'X)}^{-1}\pmb{X'}\]
Note
\[\pmb{HX}=\pmb{X(X'X)}^{-1}\pmb{X'X}=\pmb{X}\]
and
\[\pmb{\hat{\epsilon}}= \pmb{(I-H)y}= \pmb{(I-H)\epsilon}\]
If \(E[\pmb{y}]=\pmb{X'\beta}\) and \(cov(\pmb{y})=\sigma^2\pmb{I}\), then
proof straightforward calculations
From iv we see that the residuals and the fitted values should be uncorrelated. This allows a check of the model with the residual vs fits plot:
Linear model is good:
x <- 1:50
y <- 5 + 2*x + rnorm(50, 0, 5)
fit <- lm(y~x)
df <- data.frame(Fits=fitted(fit),
Residuals=residuals(fit))
ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
Linear model is bad:
x <- 1:50
y <- 0.1*x^2+rnorm(50, 0, 10)
fit <- lm(y~x)
df <- data.frame(Fits=fitted(fit),
Residuals=residuals(fit))
ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
The U shaped pattern in the residual vs. fits plot is a very common one if the linear model is bad.
The hat matrix \(\pmb{H}\) is symmetric and idempotent
proof
\[ \begin{aligned} &\pmb{H}' = \left[\{\pmb{X(X'X)}^{-1}\}\pmb{X'}\right]' = \\ &\pmb{X}\{\pmb{X(X'X)}^{-1}\}' = \\ &\pmb{X}[\pmb{(X'X)}^{-1}]'\pmb{X}' = \\ &\pmb{X}\pmb{(X'X)}^{-1}\pmb{X}' =\pmb{H} \end{aligned} \]
\[ \begin{aligned} &\pmb{HH} =\\ &\left[\{\pmb{X(X'X)}^{-1}\}\pmb{X'}\right]\left[\{\pmb{X(X'X)}^{-1}\}\pmb{X'}\right]' = \\ &\pmb{X(X'X)}^{-1}(\pmb{X'X})\pmb{(X'X)}^{-1}\pmb{X'} = \\ &\pmb{X(X'X)}^{-1}\pmb{X'}=\pmb{H} \end{aligned} \]
for the centered model we have
\[\pmb{H} = \frac1n \pmb{J}+\pmb{H}_c=\frac 1n \pmb{J}+\pmb{X_c(X_c'X_c)}^{-1}\}\pmb{X_c'}\]
proof straightforward
proof
\[h_{ii} = [\frac 1n \pmb{J}+\pmb{X_c(X_c'X_c)}^{-1}\}\pmb{X_c'}]_{ii}>\frac 1n \pmb{J}_{ii}=\frac1n\]
\(\pmb{H}\) is symmetric and idempotent, so
\[ \begin{aligned} &h_{ii} = \pmb{h'h} = \sum_j h_{ij}^2 =h_{ii}^2 \sum_j h_{j\ne i}^2 \\ &1 = h_{ii} +\sum_{j\ne i} \frac{h_{ij}^2}{h_{ii}} \end{aligned} \]
but \(h_{ii}>0\), so \(h_{ii}<1\)
\[ \begin{aligned} &h_{ii} = h_{ii}^2+h_{ik}^2+ \sum_{j\ne ik} h_{ij}^2 \\ &h_{ii}-h_{ii}^2 = h_{ik}^2+ \sum_{j\ne ik} h_{ij}^2 \\ &h_{ik}^2\le h_{ii}-h_{ii}^2\le \frac14 \end{aligned} \]
An outlier is any observation that is unusual given the model.
say \(X\sim N(0,1)\), then any observations outside [-3, 3] is an outlier.
A graphical check for outliers is again the residual vs fits plot:
x <- 1:50
y <- 5 + 2*x + rnorm(50, 0, 5)
y[1] = 5 + 2*x[1] + 25
fit <- lm(y~x)
df <- data.frame(Fits=fitted(fit),
Residuals=residuals(fit))
ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
Note that by (6.9.2ii) \(var(\hat{\epsilon}_i)=\sigma^2(1-h_{ii})\) and by (6.9.6i) \(h_{ii})\le 1\), therefore \(var(\hat{\epsilon}_i)\) will be small if \(h_{ii})\) is near 1. Moreover it can be shown that \(h_{ii})\) will be large for observations far from \(\bar{x}_i\). These issues lead to the
The studentized residuals are defined by
\[r_i = \frac{\hat{\epsilon}_i}{s\sqrt{1-h_{ii}}}\]
The idea is to eliminate the effect of observations far from their means to have large variances.
Let’s consider the houseprice data:
A=as.matrix(houseprice)
n=nrow(A)
k=ncol(A)-1
X=cbind(1, A[, -1])
y=A[, 1, drop=FALSE]
G=solve(t(X)%*%X)
betahat= G%*%t(X)%*%y
yhat=X%*%betahat
sse = t(y)%*%(diag(n)-X%*%G%*%t(X))%*%y/(n-k-1)
epsilonhat=y-yhat
H=X%*%G%*%t(X)
residual = epsilonhat-yhat
df=data.frame(residuals=residual, yhat=yhat)
ggplot(data=df, aes(yhat, epsilonhat)) +
geom_point() +
geom_abline(slope=0,intercept=0)
there are two potential outlier, one on the left and one in the upper right corner.
df$studres = epsilonhat/sqrt(sse*(1-diag(H)))
ggplot(data=df, aes(yhat, studres)) +
geom_point() +
geom_abline(slope=0,intercept=0)
which looks the same here.
The Cook’s distance is defined by
\[D_i = \frac{(\pmb{\hat{\beta}}_{(i)}-\pmb{\hat{\beta}})'\pmb{X'X}(\pmb{\hat{\beta}}_{(i)}-\pmb{\hat{\beta}})}{(k+1)s^2}\]
where \(\pmb{\hat{\beta}}_{(i)}\) is the least squares estimator with the ith observation.
The distribution of \(D_i\) is known, but as a simple rule-of-thumb any observation with \(D_i>1\) should be considered influential.
Let’s find and plot the Cook distances for the houseprice data:
A=as.matrix(houseprice)
n=nrow(A)
k=ncol(A)-1
X=cbind(1, A[, -1])
y=A[, 1, drop=FALSE]
G=solve(t(X)%*%X)
betahat= G%*%t(X)%*%y
sse = t(y)%*%(diag(n)-X%*%G%*%t(X))%*%y/(n-k-1)
D=rep(n, 0)
for(i in 1:n) {
Xtmp=X[-i, ]
ytmp=y[-i, ]
betaihat= solve(t(Xtmp)%*%Xtmp)%*%t(Xtmp)%*%ytmp
D[i]=t(betaihat-betahat)%*%t(X)%*%X%*%(betaihat-betahat)/(k+1)/sse
}
df=data.frame(index=1:n, D=D)
ggplot(data=df, aes(index, D)) +
geom_point() +
geom_abline(slope=0, intercept = 1)
so the two observations noted earlier as potential outliers do not have Cook’s distance over 1, and therefore are ok.
We will again draw the Residuals vs. Fits plot and check whether the variance (or spread) of the dots changes as you go along the x axis.
Equal Variance ok:
Equal Variance not ok:
This can be a tricky one to decide, especially if there are few observations.
To check the normal assumption we can draw the normal plot of residuals. If the assumption is ok the dot’s will follow along a straight line.
Normal assumption OK:
x <- runif(50, 0, 10)
y1 <- 1+2*x+rnorm(50, 0, 3)
fit1=lm(y1~x)
df <- data.frame(Residuals=resid(fit1),
Fits = fitted(fit1))
ggplot(data=df, aes(sample=Residuals)) +
geom_qq() + geom_qq_line()
Normal assumption not OK:
y2 <- 1+2*x+rt(50, 1)
fit2=lm(y2~x)
df <- data.frame(Residuals=resid(fit2),
Fits = fitted(fit2))
ggplot(data=df, aes(sample=Residuals)) +
geom_qq() + geom_qq_line()
In addition one can do formal hypothesis tests, so-called goodness-of-fit tests. A good test for normality is the Shapiro-Wilks test:
shapiro.test(resid(fit1))
##
## Shapiro-Wilk normality test
##
## data: resid(fit1)
## W = 0.97152, p-value = 0.2666
shapiro.test(resid(fit2))
##
## Shapiro-Wilk normality test
##
## data: resid(fit2)
## W = 0.40013, p-value = 5.303e-13