Finding the best model - Overfitting

Example: Predicting the Usage of Electricity

In Westchester County, north of New York City, Consolidated Edison bills residential customers for electricity on a monthly basis. The company wants to predict residential usage, in order to plan purchases of fuel and budget revenue flow. The data includes information on usage (in kilowatt-hours per day) and average monthly temperature for 55 consecutive months for an all-electric home. Data on consumption of electricity and the temperature in Westchester County, NY.

kable.nice(elusage[1:10, ])
Month Year Usage Temperature
8 1989 24.828 73
9 1989 24.688 67
10 1989 19.310 57
11 1989 59.706 43
12 1989 99.667 26
1 1990 49.333 41
2 1990 59.375 38
3 1990 55.172 46
4 1990 55.517 54
5 1990 25.938 60
ggplot(aes(Temperature, Usage), data=elusage) +
  geom_point()

Let’s find the least squares model:

fit <- lm(Usage~Temperature, data=elusage)
round(fit$coef, 2)
## (Intercept) Temperature 
##      116.72       -1.36

gives the model as

\[ \text{Usage} = 116.72 - 1.36 \text{ Temperature} + \epsilon \]

but

 ggplot(data=data.frame(Fits=fitted(fit),
                        Residuals=residuals(fit)),
        aes(Fits, Residuals)) +
  geom_point() +
  geom_abline(slope = 0)

shows that this is a bad model.

So now we try the

  • quadratic model
quad.fit <- lm(Usage~poly(Temperature, 2),
               data=elusage)

the residual vs fits plot for this model is

 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(quad.fit),
                        Residuals=residuals(quad.fit))) +
  geom_point() +
  geom_abline(slope = 0)

and that is much better.

  • log-log transformation
log.usage <- log(Usage)
log.temp <- log(Temperature)
log.fit <- lm(log.usage~log.temp)
 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(log.fit),
                        Residuals=residuals(log.fit))) +
  geom_point() +
  geom_abline(slope = 0)

and again this looks fine


Now we have two models with good residual vs fits plots. How do we choose among these models? A standard measure of the quality of the fit is the Coefficient of Determination. It is defined as

\[ R^2 = \text{cor}(\text{Observed Values, Predicted Values})^2 100\% \]

the better a model is, the more correlated it’s fitted values and the observed values should be, so if we have a choice of two models, the one with the higher \(R^2\) is better.

Here we find

# Quadratic Model
round(100*summary(quad.fit)$r.squared, 2)
## [1] 84.69
# Log Transfrom Model
round(100*summary(log.fit)$r.squared, 2)      
## [1] 81.12

Now the \(R^2\) of the quadratic model is \(84.69\%\) and that of the log-log transform model is \(81.12\%\), so the quadratic one is better.

Let’s have a look what those models look like:

x <- seq(min(Temperature), max(Temperature), length=100)
y.quad <- predict(quad.fit, 
                  newdata=data.frame(Temperature=x))
y.log <- exp(predict(log.fit, 
                 newdata=data.frame(log.temp=log(x))))
dta <- data.frame(x=c(x, x),
                  y=c(y.quad, y.log),
                  Model=rep(c("Quadratic", "Log"),
                         each=100))
ggplot(data=elusage, aes(Temperature, Usage)) +
  geom_point() +       
  geom_line(data=dta, aes(x, y, color=Model), size=1.2) +
  xlab("Temperature") +
  ylab("Usage")


Could we do even better? Let’s check the cubic model:

cube.fit <- lm(Usage~poly(Temperature,3),
               data=elusage)
round(100*summary(cube.fit)$r.squared, 2)   
## [1] 84.72

and yes, it’s \(R^2=84.72>84.69\)!

but we need to be careful here: the quadratic model is a special case of the cubic model, and so it’s \(R^2\) can never be smaller.

The reason for this is simple: Say we find the best quadratic model, which is \[ \text{Usage} = \hat{\beta}_{02} - \hat{\beta}_{12}\text{ T} + \hat{\beta}_{22}\text{ T}^2 \] Now we add the cubic term T3 as a predictor. One (of many) cubic models is

\[ \text{Usage} = \hat{\beta}_{02} - \hat{\beta}_{12}\text{ T} + \hat{\beta}_{22}\text{ T}^2 +0.0\text{ T}^3 \]

this is of course the same as the quadratic model above, so it has \(R^2=84.69\%\). Only the least squares cubic model is the best cubic model, so it’s \(R^2\) cannot be smaller (and because of statistical fluctuation usually will be even a bit higher, even if the cubic term is not useful).

Question: which of these polynomial models should you use?

Linear Model

Quadratic Model

Cubic Model

Power 11 Model

and this one is perfect, it has \(R^2=100\%\).

Actually, it is always possible to find a polynomial model which fits the data set perfectly, that is it has \(R^2=100\%\)! (Hint: look up Legendre polynomials)

But: we want our models to fit the relationship, not the random fluctuations in the data set.

A model should be parsimoneous, that is as simple as possible.

This is in agreement with one of the fundamental principles of science:

Ockham’s razor, named after William of Ockham

Ockham’s razor is the principle that “entities must not be multiplied beyond necessity” (entia non sunt multiplicanda praeter necessitatem). The popular interpretation of this principle is that the simplest explanation is usually the correct one.

For our problem this means: Use the polynomial model of lowest degree that can’t be improved statistically significantly by adding another power.

Let’s consider again the quadratic and the cubic models: the cubic model is better than the quadratic one (in terms of R2), but is it statistically significantly better?

It turns out we can actually test this:

anova(cube.fit, quad.fit)
## Analysis of Variance Table
## 
## Model 1: Usage ~ poly(Temperature, 3)
## Model 2: Usage ~ poly(Temperature, 2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     51 4756.9                           
## 2     52 4766.4 -1   -9.4466 0.1013 0.7516

and so the answer is no, the cubic is not better. So we should use the quadratic one!


Warning as always a GOOD MODEL is one with a good residual vs. fits plot. It can happen that both the quadratic and the cubic are bad models and this test fails to reject the null because they are equally bad!


Note: if we have two models, one of which is a special case of the other, we say we have nested models.

Example: quadratic and cubic

Example: y vs x and y vs x, z

In all of those cases the model with more predictors will NEVER have a smaller \(R^2\), so using \(R^2\) would always lead to the model with all the terms, which may not be best.

Choosing between good Models

In choosing the best model (from our short list) proceed as follows: Model is “good” = no pattern in the Residual vs. Fits plot

  1. If a linear model is good, use it, you are done

If the linear model is not good, proceed as follows

  1. check the transformation models and see which of these (if any) are good.

  2. find the best polynomial model using method described above.

  3. Choose as the best of the good models in 2) and 3) the one which has the highest \(R^2\).

Back to Electricity usage. We have found:

  • best transformation model is y vs log of x with \(R^2=82.9\%\)

  • best polynomial model is the quadratic with \(R^2=84.7\%\)

  • so best overall is quadratic.

An important comment

Having a high \(R^2\) is desirable, but neither necessary for a model to be good, nor an indication that a model is good:

Example 1970’s draft data:

## [1] 5.1

the linear model is a good one, even though it has a very low \(R^2=5.1\%\)

Example fabric wear data:

## [1] 88.6

the linear model is bad, even though it has a fairly high \(R^2=88.6\%\).

Example: Lunatics in Massachusettes

The data are from an 1854 survey conducted by the Massachusetts Commission on Lunacy (Mental disorder) under the leadership of Edward Jarvis. Dr. Jarvis was President of the American Statistical Association from 1852 to 1882.

kable.nice(lunatics)
County Number Distance Percent.at.Home
BERKSHIRE 119 97 77
FRANKLIN 84 62 81
HAMPSHIRE 94 54 75
HAMPDEN 105 52 69
WORCESTER 351 20 64
MIDDLESEX 357 14 47
ESSEX 377 10 47
SUFFOLK 458 4 6
NORFOLK 241 14 49
BRISTOL 158 14 60
PLYMOUTH 139 16 68
BARNSTABLE 78 44 76
NANTUCKET 12 77 25
DUKES 19 52 79

We want to find a model to predict the percentage of lunatics kept at home by the distance to the nearest insane asylum.

ggplot(data=lunatics, aes(Distance, Percent.at.Home)) +
  geom_point() +
  labs(x= "Percentage")

First we have a serious outlier. This turns out to be Nantucket, which is not a surprise because it is an island and people had to take a boat to get to the nearest asylum. We will therefore take Nantucket out of the data set:

df <- data.frame(
  Distance=lunatics$Distance[-13],
  Percentage=lunatics$Percent.at.Home[-13])
df <- df[order(df$Distance), ]
ggplot(data=df, aes(Distance, Percentage)) +
  geom_point() 

Now we will fit polynomial models with increasing degrees, see what the residua vs. fits plot looks like and compare consecutive models with the F test:

fits <- as.list(1:5)
for(i in 1:5) {
  fits[[i]] <- lm(Percentage~poly(Distance, i),
                  data=df)
  print(ggplot(data=data.frame(Fits=fitted(fits[[i]]),
                        Residuals=residuals(fits[[i]])),
        aes(Fits, Residuals)) +
     geom_point() +
     geom_abline(slope = 0))
  cat("Degree =", i, "\n")
  cat("R^2 =", 
      round(100*summary(fits[[i]])$r.squared, 1), "\n")
  if(i>1) 
    cat("p value =", 
        round(anova(fits[[i]], fits[[i-1]])[[6]][2], 3))
}

## Degree = 1 
## R^2 = 50.3

## Degree = 2 
## R^2 = 70.7 
## p value = 0.025

## Degree = 3 
## R^2 = 88.7 
## p value = 0.004

## Degree = 4 
## R^2 = 94.1 
## p value = 0.027

## Degree = 5 
## R^2 = 94.2 
## p value = 0.782

so we find that the polynomial of degree 4 is statistically significantly better than the cubic on (p=0.027) but the degree 5 is not stat. signif. better than the degree 4 one (p=0.782). So we will use degree 4.

What does this model look like?

fit <- lm(Percentage~poly(Distance, 4), data=df)
x <- seq(1, 99, length=100)
df1 <- data.frame(x=x, 
        y=predict(fit, newdata=data.frame(Distance=x)))
ggplot(data=df, aes(Distance, Percentage)) +
  geom_point() +
  geom_line(data=df1, aes(x, y),
            color="blue", size=1.2)

and this is not all that good either, the curve on the right of x=75 is basically determined by one observation and the bump at x=85 is an artifact of a power 4 model.