Finding the best model - Overfitting

Example: Predicting the Usage of Electricity

In Westchester County, north of New York City, Consolidated Edison bills residential customers for electricity on a monthly basis. The company wants to predict residential usage, in order to plan purchases of fuel and budget revenue flow. The data includes information on usage (in kilowatt-hours per day) and average monthly temperature for 55 consecutive months for an all-electric home. Data on consumption of electricity and the temperature in Westchester County, NY.

kable.nice(elusage[1:10, ])

Month	Year	Usage	Temperature
8	1989	24.828	73
9	1989	24.688	67
10	1989	19.310	57
11	1989	59.706	43
12	1989	99.667	26
1	1990	49.333	41
2	1990	59.375	38
3	1990	55.172	46
4	1990	55.517	54
5	1990	25.938	60

ggplot(aes(Temperature, Usage), data=elusage) +
  geom_point()

Let’s find the least squares model:

fit <- lm(Usage~Temperature, data=elusage)
round(fit$coef, 2)

## (Intercept) Temperature 
##      116.72       -1.36

gives the model as

\[ \text{Usage} = 116.72 - 1.36 \text{ Temperature} + \epsilon \]

but

 ggplot(data=data.frame(Fits=fitted(fit),
                        Residuals=residuals(fit)),
        aes(Fits, Residuals)) +
  geom_point() +
  geom_abline(slope = 0)

shows that this is a bad model.

So now we try the

quadratic model

quad.fit <- lm(Usage~poly(Temperature, 2),
               data=elusage)

the residual vs fits plot for this model is

 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(quad.fit),
                        Residuals=residuals(quad.fit))) +
  geom_point() +
  geom_abline(slope = 0)

and that is much better.

log-log transformation

log.usage <- log(Usage)
log.temp <- log(Temperature)
log.fit <- lm(log.usage~log.temp)

 ggplot(aes(Fits, Residuals), 
        data=data.frame(Fits=fitted(log.fit),
                        Residuals=residuals(log.fit))) +
  geom_point() +
  geom_abline(slope = 0)

and again this looks fine

Now we have two models with good residual vs fits plots. How do we choose among these models? A standard measure of the quality of the fit is the Coefficient of Determination. It is defined as

\[ R^2 = \text{cor}(\text{Observed Values, Predicted Values})^2 100\% \]

the better a model is, the more correlated it’s fitted values and the observed values should be, so if we have a choice of two models, the one with the higher \(R^2\) is better.

Here we find

# Quadratic Model
round(100*summary(quad.fit)$r.squared, 2)

## [1] 84.69

# Log Transfrom Model
round(100*summary(log.fit)$r.squared, 2)

## [1] 81.12

Now the \(R^2\) of the quadratic model is \(84.69\%\) and that of the log-log transform model is \(81.12\%\), so the quadratic one is better.

Let’s have a look what those models look like:

x <- seq(min(Temperature), max(Temperature), length=100)
y.quad <- predict(quad.fit, 
                  newdata=data.frame(Temperature=x))
y.log <- exp(predict(log.fit, 
                 newdata=data.frame(log.temp=log(x))))
dta <- data.frame(x=c(x, x),
                  y=c(y.quad, y.log),
                  Model=rep(c("Quadratic", "Log"),
                         each=100))
ggplot(data=elusage, aes(Temperature, Usage)) +
  geom_point() +       
  geom_line(data=dta, aes(x, y, color=Model), size=1.2) +
  xlab("Temperature") +
  ylab("Usage")

Could we do even better? Let’s check the cubic model:

cube.fit <- lm(Usage~poly(Temperature,3),
               data=elusage)
round(100*summary(cube.fit)$r.squared, 2)

## [1] 84.72

and yes, it’s \(R^2=84.72>84.69\)!

but we need to be careful here: the quadratic model is a special case of the cubic model, and so it’s \(R^2\) can never be smaller.

The reason for this is simple: Say we find the best quadratic model, which is \[ \text{Usage} = \hat{\beta}_{02} - \hat{\beta}_{12}\text{ T} + \hat{\beta}_{22}\text{ T}^2 \] Now we add the cubic term T³ as a predictor. One (of many) cubic models is

\[ \text{Usage} = \hat{\beta}_{02} - \hat{\beta}_{12}\text{ T} + \hat{\beta}_{22}\text{ T}^2 +0.0\text{ T}^3 \]

this is of course the same as the quadratic model above, so it has \(R^2=84.69\%\). Only the least squares cubic model is the best cubic model, so it’s \(R^2\) cannot be smaller (and because of statistical fluctuation usually will be even a bit higher, even if the cubic term is not useful).

Question: which of these polynomial models should you use?

Linear Model

Quadratic Model

Cubic Model

Power 11 Model

and this one is perfect, it has \(R^2=100\%\).

Actually, it is always possible to find a polynomial model which fits the data set perfectly, that is it has \(R^2=100\%\)! (Hint: look up Legendre polynomials)

But: we want our models to fit the relationship, not the random fluctuations in the data set.

A model should be parsimoneous, that is as simple as possible.

This is in agreement with one of the fundamental principles of science:

Ockham’s razor, named after William of Ockham

Ockham’s razor is the principle that “entities must not be multiplied beyond necessity” (entia non sunt multiplicanda praeter necessitatem). The popular interpretation of this principle is that the simplest explanation is usually the correct one.

For our problem this means: Use the polynomial model of lowest degree that can’t be improved statistically significantly by adding another power.

Let’s consider again the quadratic and the cubic models: the cubic model is better than the quadratic one (in terms of R²), but is it statistically significantly better?

It turns out we can actually test this:

anova(cube.fit, quad.fit)

## Analysis of Variance Table
## 
## Model 1: Usage ~ poly(Temperature, 3)
## Model 2: Usage ~ poly(Temperature, 2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     51 4756.9                           
## 2     52 4766.4 -1   -9.4466 0.1013 0.7516

and so the answer is no, the cubic is not better. So we should use the quadratic one!

Warning as always a GOOD MODEL is one with a good residual vs. fits plot. It can happen that both the quadratic and the cubic are bad models and this test fails to reject the null because they are equally bad!

Note: if we have two models, one of which is a special case of the other, we say we have nested models.

Example: quadratic and cubic

Example: y vs x and y vs x, z

In all of those cases the model with more predictors will NEVER have a smaller \(R^2\), so using \(R^2\) would always lead to the model with all the terms, which may not be best.

Choosing between good Models

In choosing the best model (from our short list) proceed as follows: Model is “good” = no pattern in the Residual vs. Fits plot

If a linear model is good, use it, you are done

If the linear model is not good, proceed as follows

check the transformation models and see which of these (if any) are good.
find the best polynomial model using method described above.
Choose as the best of the good models in 2) and 3) the one which has the highest \(R^2\).

Back to Electricity usage. We have found:

best transformation model is y vs log of x with \(R^2=82.9\%\)
best polynomial model is the quadratic with \(R^2=84.7\%\)
so best overall is quadratic.

An important comment

Having a high \(R^2\) is desirable, but neither necessary for a model to be good, nor an indication that a model is good:

Example 1970’s draft data:

## [1] 5.1

the linear model is a good one, even though it has a very low \(R^2=5.1\%\)

Example fabric wear data:

## [1] 88.6

the linear model is bad, even though it has a fairly high \(R^2=88.6\%\).

Example: Lunatics in Massachusettes

The data are from an 1854 survey conducted by the Massachusetts Commission on Lunacy (Mental disorder) under the leadership of Edward Jarvis. Dr. Jarvis was President of the American Statistical Association from 1852 to 1882.

kable.nice(lunatics)

County	Number	Distance	Percent.at.Home
BERKSHIRE	119	97	77
FRANKLIN	84	62	81
HAMPSHIRE	94	54	75
HAMPDEN	105	52	69
WORCESTER	351	20	64
MIDDLESEX	357	14	47
ESSEX	377	10	47
SUFFOLK	458	4	6
NORFOLK	241	14	49
BRISTOL	158	14	60
PLYMOUTH	139	16	68
BARNSTABLE	78	44	76
NANTUCKET	12	77	25
DUKES	19	52	79

We want to find a model to predict the percentage of lunatics kept at home by the distance to the nearest insane asylum.

ggplot(data=lunatics, aes(Distance, Percent.at.Home)) +
  geom_point() +
  labs(x= "Percentage")

First we have a serious outlier. This turns out to be Nantucket, which is not a surprise because it is an island and people had to take a boat to get to the nearest asylum. We will therefore take Nantucket out of the data set:

df <- data.frame(
  Distance=lunatics$Distance[-13],
  Percentage=lunatics$Percent.at.Home[-13])
df <- df[order(df$Distance), ]

ggplot(data=df, aes(Distance, Percentage)) +
  geom_point()

Now we will fit polynomial models with increasing degrees, see what the residua vs. fits plot looks like and compare consecutive models with the F test:

fits <- as.list(1:5)
for(i in 1:5) {
  fits[[i]] <- lm(Percentage~poly(Distance, i),
                  data=df)
  print(ggplot(data=data.frame(Fits=fitted(fits[[i]]),
                        Residuals=residuals(fits[[i]])),
        aes(Fits, Residuals)) +
     geom_point() +
     geom_abline(slope = 0))
  cat("Degree =", i, "\n")
  cat("R^2 =", 
      round(100*summary(fits[[i]])$r.squared, 1), "\n")
  if(i>1) 
    cat("p value =", 
        round(anova(fits[[i]], fits[[i-1]])[[6]][2], 3))
}

## Degree = 1 
## R^2 = 50.3

## Degree = 2 
## R^2 = 70.7 
## p value = 0.025

## Degree = 3 
## R^2 = 88.7 
## p value = 0.004

## Degree = 4 
## R^2 = 94.1 
## p value = 0.027

## Degree = 5 
## R^2 = 94.2 
## p value = 0.782

so we find that the polynomial of degree 4 is statistically significantly better than the cubic on (p=0.027) but the degree 5 is not stat. signif. better than the degree 4 one (p=0.782). So we will use degree 4.

What does this model look like?

fit <- lm(Percentage~poly(Distance, 4), data=df)
x <- seq(1, 99, length=100)
df1 <- data.frame(x=x, 
        y=predict(fit, newdata=data.frame(Distance=x)))
ggplot(data=df, aes(Distance, Percentage)) +
  geom_point() +
  geom_line(data=df1, aes(x, y),
            color="blue", size=1.2)

and this is not all that good either, the curve on the right of x=75 is basically determined by one observation and the bump at x=85 is an artifact of a power 4 model.