This page explains the assumptions behind the method of least squares regression and how to check them.
Recall that we are fitting a model of the form
\[ y=\beta_0+\beta_1x \]
there are three assumptions:
The model is good (that is, the relationship is linear and not, say, quadratic, exponential or something else)
The residuals have a normal distribution
The residuals have equal variance (are homoscadastic)
The second and third assumption we are already familiar with from ANOVA and correlation.
We can check these assumptions using two graphs:
Residual vs. Fits plot: this is just what it says, a scatterplot of the residuals (on y-axis) vs. the fitted values.
Normal plot of residuals
Both of these graphs are done by R automatically.
1) Good Model
For this assumption draw the Residuals vs. Fits plot and check for any pattern
Example:
Linear model is good:
Linear model is bad:
The U shaped pattern in the residual vs. fits plot is a very common one if the linear model is bad.
2) Residuals have a Normal Distribution
For this assumption draw the normal probability plot and see whether the dots form a straight line, just as we have done it many times by now.
3) Residuals have Equal Variance
Previously we could check the stdev within the groups and see whether they differed by more than a factor of 3. Now, though we don’t have groups. Instead we will again draw the Residuals vs. Fits plot and check whether the variance (or spread) of the dots changes as you go along the x axis.
Equal Variance ok:
Equal Variance not ok:
This can be a tricky one to decide, especially if there are few observations.
Let’s check the assumptions for the wine consumption data:
attach(wine)
slr(Heart.Disease.Deaths, Wine.Consumption)
## The least squares regression equation is:
## Heart.Disease.Deaths = 260.563 - 22.969 Wine.Consumption
## R^2 = 71.03%
the normal plot is fine, and the residual vs. fits plot is fine as far the linear model assumption goes. There is, though, an appearance of unequal variance. This judgement is made more difficult here, though, because there is very little data in the left half of the graph, and naturaly a few dots won’t have a large spread. It will take time for you to be able to judge these graphs properly. In fact this one is ok. Not great, but ok.
Note a final decision on whether the assumptions are justified is ALWAYS made based on the Residual vs. Fits Plot and the Normal plot of Residuals.