In Westchester County, north of New York City, Consolidated Edison bills residential customers for electricity on a monthly basis. The company wants to predict residential usage, in order to plan purchases of fuel and budget revenue flow. The data includes information on usage (in kilowatt-hours per day) and average monthly temperature for 55 consecutive months for an all-electric home. Data on consumption of electricity and the temperature in Westchester County, NY.
kable.nice(elusage[1:10, ])
Month | Year | Usage | Temperature |
---|---|---|---|
8 | 1989 | 24.828 | 73 |
9 | 1989 | 24.688 | 67 |
10 | 1989 | 19.310 | 57 |
11 | 1989 | 59.706 | 43 |
12 | 1989 | 99.667 | 26 |
1 | 1990 | 49.333 | 41 |
2 | 1990 | 59.375 | 38 |
3 | 1990 | 55.172 | 46 |
4 | 1990 | 55.517 | 54 |
5 | 1990 | 25.938 | 60 |
ggplot(aes(Temperature, Usage), data=elusage) +
geom_point()
Let’s find the least squares model:
fit <- lm(Usage~Temperature, data=elusage)
round(fit$coef, 2)
## (Intercept) Temperature
## 116.72 -1.36
gives the model as
\[ \text{Usage} = 116.72 - 1.36 \text{ Temperature} + \epsilon \]
but
ggplot(data=data.frame(Fits=fitted(fit),
Residuals=residuals(fit)),
aes(Fits, Residuals)) +
geom_point() +
geom_abline(slope = 0)
shows that this is a bad model.
So now we try the
quad.fit <- lm(Usage~poly(Temperature, 2),
data=elusage)
the residual vs fits plot for this model is
ggplot(aes(Fits, Residuals),
data=data.frame(Fits=fitted(quad.fit),
Residuals=residuals(quad.fit))) +
geom_point() +
geom_abline(slope = 0)
and that is much better.
log.usage <- log(Usage)
log.temp <- log(Temperature)
log.fit <- lm(log.usage~log.temp)
ggplot(aes(Fits, Residuals),
data=data.frame(Fits=fitted(log.fit),
Residuals=residuals(log.fit))) +
geom_point() +
geom_abline(slope = 0)
and again this looks fine
Now we have two models with good residual vs fits plots. How do we choose among these models? A standard measure of the quality of the fit is the Coefficient of Determination. It is defined as
\[ R^2 = \text{cor}(\text{Observed Values, Predicted Values})^2 100\% \]
the better a model is, the more correlated it’s fitted values and the observed values should be, so if we have a choice of two models, the one with the higher \(R^2\) is better.
Here we find
# Quadratic Model
round(100*summary(quad.fit)$r.squared, 2)
## [1] 84.69
# Log Transfrom Model
round(100*summary(log.fit)$r.squared, 2)
## [1] 81.12
Now the \(R^2\) of the quadratic model is \(84.69\%\) and that of the log-log transform model is \(81.12\%\), so the quadratic one is better.
Let’s have a look what those models look like:
x <- seq(min(Temperature), max(Temperature), length=100)
y.quad <- predict(quad.fit,
newdata=data.frame(Temperature=x))
y.log <- exp(predict(log.fit,
newdata=data.frame(log.temp=log(x))))
dta <- data.frame(x=c(x, x),
y=c(y.quad, y.log),
Model=rep(c("Quadratic", "Log"),
each=100))
ggplot(data=elusage, aes(Temperature, Usage)) +
geom_point() +
geom_line(data=dta, aes(x, y, color=Model), size=1.2) +
xlab("Temperature") +
ylab("Usage")
Could we do even better? Let’s check the cubic model:
cube.fit <- lm(Usage~poly(Temperature,3),
data=elusage)
round(100*summary(cube.fit)$r.squared, 2)
## [1] 84.72
and yes, it’s \(R^2=84.72>84.69\)!
but we need to be careful here: the quadratic model is a special case of the cubic model, and so it’s \(R^2\) can never be smaller.
The reason for this is simple: Say we find the best quadratic model, which is \[ \text{Usage} = \hat{\beta}_{02} - \hat{\beta}_{12}\text{ T} + \hat{\beta}_{22}\text{ T}^2 \] Now we add the cubic term T3 as a predictor. One (of many) cubic models is
\[ \text{Usage} = \hat{\beta}_{02} - \hat{\beta}_{12}\text{ T} + \hat{\beta}_{22}\text{ T}^2 +0.0\text{ T}^3 \]
this is of course the same as the quadratic model above, so it has \(R^2=84.69\%\). Only the least squares cubic model is the best cubic model, so it’s \(R^2\) cannot be smaller (and because of statistical fluctuation usually will be even a bit higher, even if the cubic term is not useful).
and this one is perfect, it has \(R^2=100\%\).
Actually, it is always possible to find a polynomial model which fits the data set perfectly, that is it has \(R^2=100\%\)! (Hint: look up Legendre polynomials)
But: we want our models to fit the relationship, not the random fluctuations in the data set.
A model should be parsimoneous, that is as simple as possible.
This is in agreement with one of the fundamental principles of science:
Ockham’s razor, named after William of Ockham
Ockham’s razor is the principle that “entities must not be multiplied beyond necessity” (entia non sunt multiplicanda praeter necessitatem). The popular interpretation of this principle is that the simplest explanation is usually the correct one.
For our problem this means: Use the polynomial model of lowest degree that can’t be improved statistically significantly by adding another power.
Let’s consider again the quadratic and the cubic models: the cubic model is better than the quadratic one (in terms of R2), but is it statistically significantly better?
It turns out we can actually test this:
anova(cube.fit, quad.fit)
## Analysis of Variance Table
##
## Model 1: Usage ~ poly(Temperature, 3)
## Model 2: Usage ~ poly(Temperature, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 51 4756.9
## 2 52 4766.4 -1 -9.4466 0.1013 0.7516
and so the answer is no, the cubic is not better. So we should use the quadratic one!
Warning as always a GOOD MODEL is one with a good residual vs. fits plot. It can happen that both the quadratic and the cubic are bad models and this test fails to reject the null because they are equally bad!
Note: if we have two models, one of which is a special case of the other, we say we have nested models.
Example: quadratic and cubic
Example: y vs x and y vs x, z
In all of those cases the model with more predictors will NEVER have a smaller \(R^2\), so using \(R^2\) would always lead to the model with all the terms, which may not be best.
In choosing the best model (from our short list) proceed as follows: Model is “good” = no pattern in the Residual vs. Fits plot
If the linear model is not good, proceed as follows
check the transformation models and see which of these (if any) are good.
find the best polynomial model using method described above.
Choose as the best of the good models in 2) and 3) the one which has the highest \(R^2\).
Back to Electricity usage. We have found:
best transformation model is y vs log of x with \(R^2=82.9\%\)
best polynomial model is the quadratic with \(R^2=84.7\%\)
so best overall is quadratic.
An important comment
Having a high \(R^2\) is desirable, but neither necessary for a model to be good, nor an indication that a model is good:
Example 1970’s draft data:
## [1] 5.1
the linear model is a good one, even though it has a very low \(R^2=5.1\%\)
Example fabric wear data:
## [1] 88.6
the linear model is bad, even though it has a fairly high \(R^2=88.6\%\).
The data are from an 1854 survey conducted by the Massachusetts Commission on Lunacy (Mental disorder) under the leadership of Edward Jarvis. Dr. Jarvis was President of the American Statistical Association from 1852 to 1882.
kable.nice(lunatics)
County | Number | Distance | Percent.at.Home |
---|---|---|---|
BERKSHIRE | 119 | 97 | 77 |
FRANKLIN | 84 | 62 | 81 |
HAMPSHIRE | 94 | 54 | 75 |
HAMPDEN | 105 | 52 | 69 |
WORCESTER | 351 | 20 | 64 |
MIDDLESEX | 357 | 14 | 47 |
ESSEX | 377 | 10 | 47 |
SUFFOLK | 458 | 4 | 6 |
NORFOLK | 241 | 14 | 49 |
BRISTOL | 158 | 14 | 60 |
PLYMOUTH | 139 | 16 | 68 |
BARNSTABLE | 78 | 44 | 76 |
NANTUCKET | 12 | 77 | 25 |
DUKES | 19 | 52 | 79 |
We want to find a model to predict the percentage of lunatics kept at home by the distance to the nearest insane asylum.
ggplot(data=lunatics, aes(Distance, Percent.at.Home)) +
geom_point() +
labs(x= "Percentage")
First we have a serious outlier. This turns out to be Nantucket, which is not a surprise because it is an island and people had to take a boat to get to the nearest asylum. We will therefore take Nantucket out of the data set:
df <- data.frame(
Distance=lunatics$Distance[-13],
Percentage=lunatics$Percent.at.Home[-13])
df <- df[order(df$Distance), ]
ggplot(data=df, aes(Distance, Percentage)) +
geom_point()
Now we will fit polynomial models with increasing degrees, see what the residua vs. fits plot looks like and compare consecutive models with the F test:
fits <- as.list(1:5)
for(i in 1:5) {
fits[[i]] <- lm(Percentage~poly(Distance, i),
data=df)
print(ggplot(data=data.frame(Fits=fitted(fits[[i]]),
Residuals=residuals(fits[[i]])),
aes(Fits, Residuals)) +
geom_point() +
geom_abline(slope = 0))
cat("Degree =", i, "\n")
cat("R^2 =",
round(100*summary(fits[[i]])$r.squared, 1), "\n")
if(i>1)
cat("p value =",
round(anova(fits[[i]], fits[[i-1]])[[6]][2], 3))
}
## Degree = 1
## R^2 = 50.3
## Degree = 2
## R^2 = 70.7
## p value = 0.025
## Degree = 3
## R^2 = 88.7
## p value = 0.004
## Degree = 4
## R^2 = 94.1
## p value = 0.027
## Degree = 5
## R^2 = 94.2
## p value = 0.782
so we find that the polynomial of degree 4 is statistically significantly better than the cubic on (p=0.027) but the degree 5 is not stat. signif. better than the degree 4 one (p=0.782). So we will use degree 4.
What does this model look like?
fit <- lm(Percentage~poly(Distance, 4), data=df)
x <- seq(1, 99, length=100)
df1 <- data.frame(x=x,
y=predict(fit, newdata=data.frame(Distance=x)))
ggplot(data=df, aes(Distance, Percentage)) +
geom_point() +
geom_line(data=df1, aes(x, y),
color="blue", size=1.2)
and this is not all that good either, the curve on the right of x=75 is basically determined by one observation and the bump at x=85 is an artifact of a power 4 model.