Example consider the dataset Ethanol. We will (for now) only consider the problem of modeling NOx on E.
attach(ethanol)
plot(E, NOx)
shows a fairly strong relationship, but also a very non-linear one. Moreover the transformations we have considered previously (sqrt, log, BoxCox) all only work for monotone relationships, clearly not the case here. We can try and find a reasonable model for this problem by including higher-order terms such as x2, x3 etc. So we will fit models of the form:
Quadratic: | y=β0+β1x+β2x2 |
Cubic: | y=β0+β1x+β2x2+β3x3 |
Biquadratic: | y=β0+β1x+β2x2+β3x3+β4x4 |
Notice that these are still linear models, namely linear in the β's.
One problem with including polynomial terms in a regression model is that terms like x2 and x3 are (almost) always highly correlated, and including highly correlated predictors can lead to serious problems. In R we can deal with this by using the poly() function, which transforms the functions x, x2, x3, ... into a set of orthogonal polynomials (that is with cor=0), and then does the fitting. We can recover the coefficients of the standard polynomial with the function poly.transform.
ethanol.fun(1,1)
draws the scatterplot overlayed with the corresponding model of degree deg, and the residual vs. fits plot.
A difficult questions is this: at what point should we stop adding higher-order terms? One solution is to add terms until the one of highest order is no longer significant, and then use the previous model. In the ethanol case this leads to a model of degree 6. In general, though, we prefer models with as few terms as possible (parsimoneous models) and this approach tends to lead to rather large models.
As in multiple regression adding another term to a model will never decrease R2 even if the extra term is useless. Nevertheless a consideration of R2 can be interesting:
ethanol.fun(2)
plots R2 vs. degree, and we see first a big jump from 1 to 2, and then another (smaller) jump from 3 to 4. After that the increases are all very small. So including a fourth order term helps a lot but including any higher order term only improves the model marginally. Therefore a model with degree 4 seems to do a good job of balancing the quality of the fit and the simplicity of the model.
Example Let's again have a look at the US Temperature dataset. We already discussed the fact that we need to change to -Longitude to have the correct orientation, done it
ustemp(1)
ustemp.fun(2)
plots Jantemp vs. the predictors and see a strong relationship with Latitude and a weak (if even existing) relationship with Longitude.
ustemp.fun(3)
fits the model Jantemp~Latitude+Longitude and draws the residuals vs. fits plot, using the stud. residuals. There is a clear U-shaped pattern in this graph.
ustemp.fun(4)
adds a quadratic term, fitting Jantemp~Latitude+poly(Longitude,2) and draw the residuals vs. fits plot. But again there is a clear U-shaped pattern.
ustemp.fun(5)
adds a cubic term, fits Jantemp~Latitude+poly(Longitude,3) and draws the residuals vs. fits plot. Now everything appears ok.
ustemp.fun(6)
checks the assumptions, which are not bad. Some of the diagnostic plots point out problems with observations 6, 24 and 52. We print out the summary of the fit and also have a look at these observations.
Can we visualize such a model? Of course we have three variables so in general we would need a 3-dimensional graph
ustemp.fun(7)
but that does not seem to give a nice graph here. Instead we might draw a contourplot
ustemp.fun(8)
which uses colors to 'code' temperature (red=hot to blue=cold). This is already better but we can do a nicer version of this graph as well
ustemp.fun(9)