Regression Diagnostics

Recall that all our calculations have been based on a number of assumptions, namely

  1. \(E[\pmb{y}]=\pmb{X'\beta}\) (model is correct)
  2. \(var(\epsilon_i)=\sigma^2\) (equal variance, homoscadasticity)
  3. \(cov(\epsilon_i, \epsilon_j)=0\) (independence)
  4. \(\pmb{y}\sim N_n(\pmb{X'\beta}, \sigma^2\pmb{V})\) (normal residuals)

some or all of these assumptions will have to hold for our analysis, so the question arises, how does one check them? This is known as regression diagnostics.

Residuals

Definition (6.9.1)

  • The residuals are defined as

\[\hat{\epsilon}_i = y_i-\hat{y}_i\]

  • the hat matrix is defined by

\[\pmb{\hat{y}} = \pmb{X\hat{\beta}}=\pmb{X(X'X)}^{-1}\pmb{X'y}=\pmb{Hy}\]

so

\[\pmb{H}=\pmb{X(X'X)}^{-1}\pmb{X'}\]


Note

\[\pmb{HX}=\pmb{X(X'X)}^{-1}\pmb{X'X}=\pmb{X}\]

and

\[\pmb{\hat{\epsilon}}= \pmb{(I-H)y}= \pmb{(I-H)\epsilon}\]

Theorem (6.9.2)

If \(E[\pmb{y}]=\pmb{X'\beta}\) and \(cov(\pmb{y})=\sigma^2\pmb{I}\), then

  1. \(E[\pmb{\hat{\epsilon}}]= \pmb{0}\)
  2. \(cov(\pmb{\hat{\epsilon}})= \sigma^2(\pmb{I-H})\)
  3. \(cov(\pmb{\hat{\epsilon},y})= \sigma^2(\pmb{I-H})\)
  4. \(cov(\pmb{\hat{\epsilon},\hat{y}})= \pmb{O}\)
  5. \(\bar{\hat{\epsilon}}=0\)
  6. \(\pmb{\hat{\epsilon}'y}=\text{SSE} = \pmb{y}'(\pmb{I-H}\pmb{y}\)
  7. \(\pmb{\hat{\epsilon}'\hat{y}}= \pmb{0}\)
  8. \(\pmb{\hat{\epsilon}'X}= \pmb{0}'\)

proof straightforward calculations

From iv we see that the residuals and the fitted values should be uncorrelated. This allows a check of the model with the residual vs fits plot:

Example (6.9.3)

Linear model is good:

x <- 1:50
y <- 5 + 2*x + rnorm(50, 0, 5) 
fit <- lm(y~x)
df <- data.frame(Fits=fitted(fit),
                 Residuals=residuals(fit))
ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0)        

Linear model is bad:

x <- 1:50
y <- 0.1*x^2+rnorm(50, 0, 10) 
fit <- lm(y~x)
df <- data.frame(Fits=fitted(fit),
                 Residuals=residuals(fit))
ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0)  

The U shaped pattern in the residual vs. fits plot is a very common one if the linear model is bad.

The Hat Matrix

Theorem (6.9.4)

The hat matrix \(\pmb{H}\) is symmetric and idempotent

proof

\[ \begin{aligned} &\pmb{H}' = \left[\{\pmb{X(X'X)}^{-1}\}\pmb{X'}\right]' = \\ &\pmb{X}\{\pmb{X(X'X)}^{-1}\}' = \\ &\pmb{X}[\pmb{(X'X)}^{-1}]'\pmb{X}' = \\ &\pmb{X}\pmb{(X'X)}^{-1}\pmb{X}' =\pmb{H} \end{aligned} \]

\[ \begin{aligned} &\pmb{HH} =\\ &\left[\{\pmb{X(X'X)}^{-1}\}\pmb{X'}\right]\left[\{\pmb{X(X'X)}^{-1}\}\pmb{X'}\right]' = \\ &\pmb{X(X'X)}^{-1}(\pmb{X'X})\pmb{(X'X)}^{-1}\pmb{X'} = \\ &\pmb{X(X'X)}^{-1}\pmb{X'}=\pmb{H} \end{aligned} \]

Theorem (6.9.5)

for the centered model we have

\[\pmb{H} = \frac1n \pmb{J}+\pmb{H}_c=\frac 1n \pmb{J}+\pmb{X_c(X_c'X_c)}^{-1}\}\pmb{X_c'}\]

proof straightforward

Theorem (6.9.6)

  1. \(\frac1n \le h_{ii} \le 1\), i=1,2,..,n
  2. \(-\frac12 \le h_{ij} \le \frac12; i\ne j\)
  3. \(tr(\pmb{H})=\sum h_{ii} = k+1\)

proof

  1. \(\pmb{X_c'X_c}\) is positive definite, so it’s diagonal elements are positive. Therefore

\[h_{ii} = [\frac 1n \pmb{J}+\pmb{X_c(X_c'X_c)}^{-1}\}\pmb{X_c'}]_{ii}>\frac 1n \pmb{J}_{ii}=\frac1n\]

\(\pmb{H}\) is symmetric and idempotent, so

\[ \begin{aligned} &h_{ii} = \pmb{h'h} = \sum_j h_{ij}^2 =h_{ii}^2 \sum_j h_{j\ne i}^2 \\ &1 = h_{ii} +\sum_{j\ne i} \frac{h_{ij}^2}{h_{ii}} \end{aligned} \]

but \(h_{ii}>0\), so \(h_{ii}<1\)

  1. using the above we can write

\[ \begin{aligned} &h_{ii} = h_{ii}^2+h_{ik}^2+ \sum_{j\ne ik} h_{ij}^2 \\ &h_{ii}-h_{ii}^2 = h_{ik}^2+ \sum_{j\ne ik} h_{ij}^2 \\ &h_{ik}^2\le h_{ii}-h_{ii}^2\le \frac14 \end{aligned} \]

  1. omitted

Outliers

Definition (6.9.7)

An outlier is any observation that is unusual given the model.

Example (6.9.8)

say \(X\sim N(0,1)\), then any observations outside [-3, 3] is an outlier.

A graphical check for outliers is again the residual vs fits plot:

x <- 1:50
y <- 5 + 2*x + rnorm(50, 0, 5) 
y[1] = 5 + 2*x[1] + 25
fit <- lm(y~x)
df <- data.frame(Fits=fitted(fit),
                 Residuals=residuals(fit))
ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0)        


Note that by (6.9.2ii) \(var(\hat{\epsilon}_i)=\sigma^2(1-h_{ii})\) and by (6.9.6i) \(h_{ii})\le 1\), therefore \(var(\hat{\epsilon}_i)\) will be small if \(h_{ii})\) is near 1. Moreover it can be shown that \(h_{ii})\) will be large for observations far from \(\bar{x}_i\). These issues lead to the

Definition (6.9.9)

The studentized residuals are defined by

\[r_i = \frac{\hat{\epsilon}_i}{s\sqrt{1-h_{ii}}}\]

The idea is to eliminate the effect of observations far from their means to have large variances.

Example (6.9.10)

Let’s consider the houseprice data:

A=as.matrix(houseprice)
n=nrow(A)
k=ncol(A)-1
X=cbind(1, A[, -1])
y=A[, 1, drop=FALSE]
G=solve(t(X)%*%X)
betahat= G%*%t(X)%*%y
yhat=X%*%betahat
sse = t(y)%*%(diag(n)-X%*%G%*%t(X))%*%y/(n-k-1)
epsilonhat=y-yhat
H=X%*%G%*%t(X)
residual = epsilonhat-yhat
df=data.frame(residuals=residual, yhat=yhat)
ggplot(data=df, aes(yhat, epsilonhat)) +
  geom_point() +
  geom_abline(slope=0,intercept=0)

there are two potential outlier, one on the left and one in the upper right corner.

df$studres = epsilonhat/sqrt(sse*(1-diag(H)))
ggplot(data=df, aes(yhat, studres)) +
  geom_point() +
  geom_abline(slope=0,intercept=0)

which looks the same here.

Leverage or Influential Observations

Definition (6.9.11)

The Cook’s distance is defined by

\[D_i = \frac{(\pmb{\hat{\beta}}_{(i)}-\pmb{\hat{\beta}})'\pmb{X'X}(\pmb{\hat{\beta}}_{(i)}-\pmb{\hat{\beta}})}{(k+1)s^2}\]

where \(\pmb{\hat{\beta}}_{(i)}\) is the least squares estimator with the ith observation.

The distribution of \(D_i\) is known, but as a simple rule-of-thumb any observation with \(D_i>1\) should be considered influential.

Example (6.9.12)

Let’s find and plot the Cook distances for the houseprice data:

A=as.matrix(houseprice)
n=nrow(A)
k=ncol(A)-1
X=cbind(1, A[, -1])
y=A[, 1, drop=FALSE]
G=solve(t(X)%*%X)
betahat= G%*%t(X)%*%y
sse = t(y)%*%(diag(n)-X%*%G%*%t(X))%*%y/(n-k-1)
D=rep(n, 0)
for(i in 1:n) {
  Xtmp=X[-i, ]
  ytmp=y[-i, ]
  betaihat= solve(t(Xtmp)%*%Xtmp)%*%t(Xtmp)%*%ytmp
  D[i]=t(betaihat-betahat)%*%t(X)%*%X%*%(betaihat-betahat)/(k+1)/sse
}
df=data.frame(index=1:n, D=D)
ggplot(data=df, aes(index, D)) +
  geom_point() +
  geom_abline(slope=0, intercept = 1)

so the two observations noted earlier as potential outliers do not have Cook’s distance over 1, and therefore are ok.

Equal Variance

We will again draw the Residuals vs. Fits plot and check whether the variance (or spread) of the dots changes as you go along the x axis.

Example (6.9.13)

Equal Variance ok:

Equal Variance not ok:

This can be a tricky one to decide, especially if there are few observations.

Normal Distribution

To check the normal assumption we can draw the normal plot of residuals. If the assumption is ok the dot’s will follow along a straight line.

Normal assumption OK:

x <- runif(50, 0, 10)
y1 <- 1+2*x+rnorm(50, 0, 3) 
fit1=lm(y1~x)
df <- data.frame(Residuals=resid(fit1), 
            Fits = fitted(fit1))
ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line()       

Normal assumption not OK:

y2 <- 1+2*x+rt(50, 1) 
fit2=lm(y2~x)
df <- data.frame(Residuals=resid(fit2), 
            Fits = fitted(fit2))
ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line()       

In addition one can do formal hypothesis tests, so-called goodness-of-fit tests. A good test for normality is the Shapiro-Wilks test:

shapiro.test(resid(fit1))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit1)
## W = 0.97152, p-value = 0.2666
shapiro.test(resid(fit2))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit2)
## W = 0.40013, p-value = 5.303e-13