Prices of residencies located 30 miles south of a large metropolitan area with several possible predictor variables.
Notice the 1.7 baths!
kable.nice(houseprice)
Price | Sqfeet | Floors | Bedrooms | Baths | |
---|---|---|---|---|---|
1 | 69.0 | 1500.000 | 1 | 2 | 1.0 |
3 | 118.5 | 1880.952 | 1 | 2 | 2.0 |
4 | 104.0 | 1976.190 | 1 | 3 | 2.0 |
5 | 116.5 | 1880.952 | 1 | 3 | 2.0 |
6 | 121.5 | 1880.952 | 1 | 3 | 2.0 |
7 | 125.0 | 1976.190 | 1 | 3 | 2.0 |
8 | 128.0 | 2357.143 | 2 | 3 | 2.5 |
9 | 129.9 | 2166.667 | 1 | 3 | 1.7 |
10 | 133.0 | 2166.667 | 2 | 3 | 2.5 |
11 | 135.0 | 2166.667 | 2 | 3 | 2.5 |
12 | 137.5 | 2357.143 | 2 | 3 | 2.5 |
13 | 139.9 | 2166.667 | 1 | 3 | 2.0 |
14 | 143.9 | 2261.905 | 2 | 3 | 2.5 |
15 | 147.9 | 2547.619 | 2 | 3 | 2.5 |
16 | 154.9 | 2357.143 | 2 | 3 | 2.5 |
17 | 160.0 | 2738.095 | 2 | 3 | 2.0 |
18 | 169.0 | 2357.143 | 1 | 3 | 2.0 |
19 | 169.9 | 2642.857 | 1 | 3 | 2.0 |
20 | 125.0 | 2166.667 | 1 | 4 | 2.0 |
21 | 134.9 | 2166.667 | 1 | 4 | 2.0 |
22 | 139.9 | 2547.619 | 1 | 4 | 2.0 |
23 | 147.0 | 2642.857 | 1 | 4 | 2.0 |
24 | 159.0 | 2261.905 | 1 | 4 | 2.0 |
25 | 169.9 | 2547.619 | 2 | 4 | 3.0 |
26 | 178.9 | 2738.095 | 1 | 4 | 2.0 |
27 | 194.5 | 2833.333 | 2 | 4 | 3.0 |
28 | 219.9 | 2928.571 | 1 | 4 | 2.5 |
29 | 269.0 | 3309.524 | 2 | 4 | 3.0 |
Let’s go through the list of predictors one by one:
pushViewport(viewport(layout = grid.layout(2, 2)))
pos <- expand.grid(1:2, 1:2)
for(i in 1:4) {
plt <-
ggplot(data=houseprice,
aes_(x = as.name(names(houseprice)[i+1]),
y = as.name(names(houseprice)[1]))) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
print(plt,
vp=viewport(layout.pos.row=pos[i, 1],
layout.pos.col=pos[i, 2]))
cat("Price and ", colnames(houseprice)[i+1],
" ", round(cor(houseprice$Price, houseprice[, i+1]), 3), "\n")
}
## Price and Sqfeet 0.915
## Price and Floors 0.291
## Price and Bedrooms 0.605
## Price and Baths 0.653
fit <- lm(Price~., data=houseprice)
summary(fit)
##
## Call:
## lm(formula = Price ~ ., data = houseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.018 -5.943 1.860 5.947 30.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.61984 17.70818 -3.819 0.000882
## Sqfeet 0.08571 0.01076 7.966 4.62e-08
## Floors -26.49306 9.48952 -2.792 0.010363
## Bedrooms -9.28622 6.82985 -1.360 0.187121
## Baths 37.38067 12.26436 3.048 0.005709
##
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared: 0.8862, Adjusted R-squared: 0.8665
## F-statistic: 44.8 on 4 and 23 DF, p-value: 1.558e-10
For the assumptions there is nothing new, as before we need to check the residual vs. fits plot and the normal plot of residuals:
pushViewport(viewport(layout = grid.layout(1, 2)))
df <- data.frame(Residuals=resid(fit),
Fits = fitted(fit))
print(ggplot(data=df, aes(sample=Residuals)) +
geom_qq() + geom_qq_line(),
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0),
vp=viewport(layout.pos.row=1, layout.pos.col=2))
This appears to be a good model and the assumptions of normally distributed residuals with equal variance appears to be o.k.
Except,
Notice that there is something very strange about this model!
Let’s have a look at the correlations between the predictors:
round(cor(houseprice[, -1]), 3)
## Sqfeet Floors Bedrooms Baths
## Sqfeet 1.000 0.370 0.652 0.628
## Floors 0.370 1.000 -0.018 0.743
## Bedrooms 0.652 -0.018 1.000 0.415
## Baths 0.628 0.743 0.415 1.000
The highest correlation between predictors is r=0.743 (Floors-Baths)
As in the case of polynomial regression, highly correlated predictors are a potential problem. We will look at a solution called principle components at some point. Here we are ok.
We have previously talked about the fact that we want our models to be as simple as possible. Often that means a model with as few predictors as possible. So the question becomes:
Can we eliminate any of our predictors without making the model (stat. signif.) worse?
There are several things one can think of:
Choose based on R2
but we already know this will always lead to the model with all predictors, for the same reason that a cubic model always has an R2 at least as high as the quadratic model.
Note:
Price by Sqfeet, Floors and Bedrooms: R2=80.1%
Price by Floors, Bedrooms and Baths: R2=68.4%
Price by Sqfeet, Bedrooms and Baths: R2=83.5%
Price by Sqfeet, Floors, Bedrooms and Baths: R2=88.2%
so model with all 4 has a higher R2 than any of the models with just 3, but this will always be so, even if one of the predictors is completely useless.
Choose based on Hypothesis Tests
in the summary(fit) above we see that p_value of Bedrooms = 0.187121 > 0.05, so eliminate Bedrooms.
This sounds like a good idea AND IT IS WIDELY USED IN REAL LIFE, but it turns out to be a bad one ! The reason why is bit hard to explain, though.
Use nested models test
fit.without.bedrooms <- lm(Price~.-Bedrooms,
data=houseprice)
anova(fit, fit.without.bedrooms)
## Analysis of Variance Table
##
## Model 1: Price ~ Sqfeet + Floors + Bedrooms + Baths
## Model 2: Price ~ (Sqfeet + Floors + Bedrooms + Baths) - Bedrooms
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 4321.9
## 2 24 4669.2 -1 -347.38 1.8487 0.1871
Again, this sounds like a good idea AND AGAIN IT IS WIDELY USED IN REAL LIFE, but it turns out to be a dangerous one! To start, if we have several predictors we might want to eliminate, we immediately face the issue of simultaneous inference.
There are several methods in wide use that are essentially based on this idea, such as forward selection, backward selection and stepwise regression. These are sometimes unavoidable but need to be done with great care!
What we need is new idea:
We will find ALL possible models and calculate Mallow’s Cp statistic for each. The model with the lowest Cp is best.
library(leaps)
out <- leaps(houseprice[, -1], houseprice$Price,
method = "Cp", nbest=1)
out
## $which
## 1 2 3 4
## 1 TRUE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE TRUE
## 3 TRUE TRUE FALSE TRUE
## 4 TRUE TRUE TRUE TRUE
##
## $label
## [1] "(Intercept)" "1" "2" "3" "4"
##
## $size
## [1] 2 3 4 5
##
## $Cp
## [1] 8.834171 8.812489 4.848657 5.000000
colnames(houseprice)[2:5][out$which[seq_along(out$Cp)[out$Cp==min(out$Cp)], ]]
## [1] "Sqfeet" "Floors" "Baths"
so the best model uses Sqfeet, Floors and Baths.
To find the model we rerun lm, now without Bedrooms:
summary(fit)
##
## Call:
## lm(formula = Price ~ ., data = houseprice)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.018 -5.943 1.860 5.947 30.955
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.61984 17.70818 -3.819 0.000882
## Sqfeet 0.08571 0.01076 7.966 4.62e-08
## Floors -26.49306 9.48952 -2.792 0.010363
## Bedrooms -9.28622 6.82985 -1.360 0.187121
## Baths 37.38067 12.26436 3.048 0.005709
##
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared: 0.8862, Adjusted R-squared: 0.8665
## F-statistic: 44.8 on 4 and 23 DF, p-value: 1.558e-10
Note that the model with all four predictors has Cp=5.0. But Cp is a statistic, its exact value depends on the sample. So is the model with Sqfeet, Floors and Baths statistically significantly better than the model with all four predictors? We would need a hypothesis test to answer this question but this is not part of our course.
Prediction works just as it did for simple regression. Say we want to find a 90% interval estimate for a house that has 2000 sqfeet, one floor and two baths. Then
predict(fit.without.bedrooms,
newdata=data.frame(Sqfeet=2000,
Floors=1,
Bedrooms=0,
Baths=2),
interval="prediction", level=0.9)
## fit lwr upr
## 1 122.9797 98.07234 147.887
The dependent variable for analysis is age adjusted mortality (called “Mortality”). The data include variables measuring demographic characteristics of the cities, variables measuring climate characteristics, and variables recording the pollution potential of three different air pollutants.
x <- airpollution
rownames(x) <- NULL
kable.nice(head(airpollution[, 1:5], 3))
Mortality | JanTemp | JulyTemp | RelHum | Rain | |
---|---|---|---|---|---|
AkronOH | 921.87 | 27 | 71 | 59 | 36 |
Albany-Schenectady-TroyNY | 997.87 | 23 | 72 | 57 | 35 |
AllentownBethlehemPA-NJ | 962.35 | 29 | 74 | 54 | 44 |
kable.nice(head(x[, 6:10], 3))
Education | PopDensity | NonWhite | WhiteCollar | Pop |
---|---|---|---|---|
11.4 | 3243 | 8.8 | 42.6 | 660328 |
11.0 | 4281 | 3.5 | 50.7 | 835880 |
9.8 | 4260 | 0.8 | 39.4 | 635481 |
kable.nice(head(x[, 11:16], 3))
Pop.House | Income | HCPot | NOxPot | SO2Pot | NOx |
---|---|---|---|---|---|
3.34 | 29560 | 21 | 15 | 59 | 15 |
3.14 | 31458 | 8 | 10 | 39 | 10 |
3.21 | 31856 | 6 | 6 | 33 | 6 |
we want to look at the scatterplots and the correlations. There are 15 predictors, so there are 15 graphs and correlations.
pos <- expand.grid(1:2, 1:2)
pushViewport(viewport(layout = grid.layout(2, 2)))
for(i in 1:4) {
plt <-
ggplot(data=airpollution,
aes_(x = as.name(names(airpollution)[i+1]),
y = as.name(names(airpollution)[1]))) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
print(plt,
vp=viewport(layout.pos.row=pos[i, 1],
layout.pos.col=pos[i, 2]))
cat("Mortality and ", colnames(airpollution)[i+1],
" ", round(cor(airpollution$Mortality,
airpollution[, i+1]), 3), "\n")
}
## Mortality and JanTemp -0.016
## Mortality and JulyTemp 0.322
## Mortality and RelHum -0.101
## Mortality and Rain 0.433
pushViewport(viewport(layout = grid.layout(2, 2)))
for(i in 5:8) {
plt <-
ggplot(data=airpollution,
aes_(x = as.name(names(airpollution)[i+1]),
y = as.name(names(airpollution)[1]))) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
print(plt,
vp=viewport(layout.pos.row=pos[i-4, 1],
layout.pos.col=pos[i-4, 2]))
cat("Mortality and ", colnames(airpollution)[i+1],
" ", round(cor(airpollution$Mortality,
airpollution[, i+1]), 3), "\n")
}
## Mortality and Education -0.508
## Mortality and PopDensity 0.252
## Mortality and NonWhite 0.647
## Mortality and WhiteCollar -0.289
pushViewport(viewport(layout = grid.layout(2, 2)))
for(i in 9:11) {
plt <-
ggplot(data=airpollution,
aes_(x = as.name(names(airpollution)[i+1]),
y = as.name(names(airpollution)[1]))) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
print(plt,
vp=viewport(layout.pos.row=pos[i-8, 1],
layout.pos.col=pos[i-8, 2]))
cat("Mortality and ", colnames(airpollution)[i+1],
" ", round(cor(airpollution$Mortality,
airpollution[, i+1]), 3), "\n")
}
## Mortality and Pop 0.059
## Mortality and Pop.House 0.368
## Mortality and Income -0.283
pushViewport(viewport(layout = grid.layout(2, 2)))
for(i in 12:15) {
plt <-
ggplot(data=airpollution,
aes_(x = as.name(names(airpollution)[i+1]),
y = as.name(names(airpollution)[1]))) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
print(plt,
vp=viewport(layout.pos.row=pos[i-11, 1],
layout.pos.col=pos[i-11, 2]))
cat("Mortality and ", colnames(airpollution)[i+1],
" ", round(cor(airpollution$Mortality,
airpollution[, i+1]), 3), "\n")
}
## Mortality and HCPot -0.185
## Mortality and NOxPot -0.085
## Mortality and SO2Pot 0.419
## Mortality and NOx -0.085
There are problems with four predictors (Pop, HCPot, NOx, and NOxPot). Let’s try the log transform and check again for those predictors:
The easiest way to do this is to make a new matrix:
newair <- airpollution
newair[ ,c("Pop", "HCPot", "NOx", "NOxPot")] <-
log(newair[, c("Pop", "HCPot", "NOx", "NOxPot")])
colnames(newair)[c(10, 13, 14, 16)] <- c("log(Pop)", "log(HCPot)", "log(NOx)", "log(NOxPot)")
pushViewport(viewport(layout = grid.layout(2, 2)))
k <- 0
for(i in c(10, 13, 14, 16)) {
k <- k+1
plt <-
ggplot(data=newair,
aes_(x = as.name(names(newair)[i]),
y = as.name(names(newair)[1]))) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
print(plt,
vp=viewport(layout.pos.row=pos[k, 1],
layout.pos.col=pos[k, 2]))
cat("Mortality and ", colnames(newair)[i],
" ", round(cor(newair$Mortality,
newair[, i]), 3), "\n")
}
## Mortality and log(Pop) 0.085
## Mortality and log(HCPot) 0.125
## Mortality and log(NOx) 0.28
## Mortality and log(NOxPot) 0.28
so in all cases the log transform worked, and we will use newair from now on.
Let’s find the correlations in absolute value of the predictors with the response, in order:
cors <- round(cor(newair), 2)
sort(abs(cors[ ,"Mortality"]), decreasing = TRUE)[-1]
## NonWhite Education Rain SO2Pot Pop.House JulyTemp
## 0.65 0.51 0.43 0.42 0.37 0.32
## WhiteCollar Income log(NOx) log(NOxPot) PopDensity log(HCPot)
## 0.29 0.28 0.28 0.28 0.25 0.13
## RelHum log(Pop) JanTemp
## 0.10 0.09 0.02
Next we look at the correlations between the predictors.
cors[-1, -1]
## JanTemp JulyTemp RelHum Rain Education PopDensity NonWhite
## JanTemp 1.00 0.32 0.09 0.06 0.11 -0.08 0.46
## JulyTemp 0.32 1.00 -0.44 0.47 -0.27 -0.01 0.60
## RelHum 0.09 -0.44 1.00 -0.12 0.19 -0.15 -0.12
## Rain 0.06 0.47 -0.12 1.00 -0.47 0.08 0.30
## Education 0.11 -0.27 0.19 -0.47 1.00 -0.24 -0.21
## PopDensity -0.08 -0.01 -0.15 0.08 -0.24 1.00 -0.01
## NonWhite 0.46 0.60 -0.12 0.30 -0.21 -0.01 1.00
## WhiteCollar 0.21 -0.01 0.01 -0.11 0.49 0.25 -0.06
## log(Pop) 0.32 0.04 -0.02 -0.28 0.27 0.21 0.22
## Pop.House -0.33 0.26 -0.14 0.20 -0.39 -0.17 0.35
## Income 0.20 -0.19 0.13 -0.36 0.51 0.00 -0.10
## log(HCPot) 0.23 -0.41 0.18 -0.48 0.18 0.26 0.15
## log(NOx) 0.18 -0.30 0.10 -0.39 0.03 0.34 0.21
## SO2Pot -0.09 -0.07 -0.12 -0.13 -0.23 0.42 0.16
## log(NOxPot) 0.18 -0.30 0.10 -0.39 0.03 0.34 0.21
## WhiteCollar log(Pop) Pop.House Income log(HCPot) log(NOx)
## JanTemp 0.21 0.32 -0.33 0.20 0.23 0.18
## JulyTemp -0.01 0.04 0.26 -0.19 -0.41 -0.30
## RelHum 0.01 -0.02 -0.14 0.13 0.18 0.10
## Rain -0.11 -0.28 0.20 -0.36 -0.48 -0.39
## Education 0.49 0.27 -0.39 0.51 0.18 0.03
## PopDensity 0.25 0.21 -0.17 0.00 0.26 0.34
## NonWhite -0.06 0.22 0.35 -0.10 0.15 0.21
## WhiteCollar 1.00 0.28 -0.35 0.37 0.16 0.11
## log(Pop) 0.28 1.00 -0.26 0.41 0.48 0.50
## Pop.House -0.35 -0.26 1.00 -0.30 -0.22 -0.12
## Income 0.37 0.41 -0.30 1.00 0.29 0.25
## log(HCPot) 0.16 0.48 -0.22 0.29 1.00 0.94
## log(NOx) 0.11 0.50 -0.12 0.25 0.94 1.00
## SO2Pot -0.06 0.37 -0.01 0.07 0.57 0.68
## log(NOxPot) 0.11 0.50 -0.12 0.25 0.94 1.00
## SO2Pot log(NOxPot)
## JanTemp -0.09 0.18
## JulyTemp -0.07 -0.30
## RelHum -0.12 0.10
## Rain -0.13 -0.39
## Education -0.23 0.03
## PopDensity 0.42 0.34
## NonWhite 0.16 0.21
## WhiteCollar -0.06 0.11
## log(Pop) 0.37 0.50
## Pop.House -0.01 -0.12
## Income 0.07 0.25
## log(HCPot) 0.57 0.94
## log(NOx) 0.68 1.00
## SO2Pot 1.00 0.68
## log(NOxPot) 0.68 1.00
We find:
Because of this interpreting (understanding) the final model will be difficult.
Using perfectly correlated predictors is not possible so we eliminate one of them, say log(NOx):
newair <- newair[, -16]
Next we fit a model with all the predictors and check the assumptions:
fit <- lm(Mortality~., data=newair)
pushViewport(viewport(layout = grid.layout(1, 2)))
df <- data.frame(Residuals=resid(fit),
Fits = fitted(fit))
print(ggplot(data=df, aes(sample=Residuals)) +
geom_qq() + geom_qq_line(),
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0),
vp=viewport(layout.pos.row=1, layout.pos.col=2))
The residual vs fits plot looks fine, so there is no problem with the model.
The normal plot is ok, so no problem with the normal assumption. The residual vs fits plot looks fine, so there is no problem with the equal variance assumption.
Next we use the best subset regression to see whether we can find a model with fewer predictors.
out <- leaps(newair[, -1], newair$Mortality,
method = "Cp", nbest=1)
colnames(newair)[-1][out$which[seq_along(out$Cp)[out$Cp==min(out$Cp)], ]]
## [1] "JanTemp" "Rain" "PopDensity" "NonWhite" "WhiteCollar"
## [6] "log(NOx)"
It suggests a model based on JanTemp, Rain, PopDensity, NonWhite, WhiteCollar and LOGT(NOx) with Mallow’s Cp=4.32
df <- newair[, c("Mortality",
"JanTemp",
"Rain",
"PopDensity",
"NonWhite",
"WhiteCollar",
"log(NOx)")]
fit <- lm(Mortality~., data=df)
summary(fit)
##
## Call:
## lm(formula = Mortality ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.749 -24.087 3.249 19.473 93.474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 944.274572 47.170880 20.018 < 2e-16
## JanTemp -1.941998 0.516119 -3.763 0.000428
## Rain 1.924283 0.484198 3.974 0.000219
## PopDensity 0.006437 0.003608 1.784 0.080282
## NonWhite 4.194072 0.620038 6.764 1.18e-08
## WhiteCollar -2.723255 0.947170 -2.875 0.005839
## `log(NOx)` 17.000178 4.879795 3.484 0.001012
##
## Residual standard error: 33.46 on 52 degrees of freedom
## Multiple R-squared: 0.7425, Adjusted R-squared: 0.7127
## F-statistic: 24.98 on 6 and 52 DF, p-value: 1.028e-13
Because the best model does still include one of the pollution variables, we can conclude that pollution adds to the mortality rate.
And we are done!
The data gives the normal average January minimum temperature in degrees Fahrenheit with the latitude and longitude of 56 U.S. cities. (For each year from 1931 to 1960, the daily minimum temperatures in January were added together and divided by 31. Then, the averages for each year were averaged over the 30 years.)
Variables:
City: City
State: State postal abbreviation
JanTemp: Average January minimum temperature in degrees F.
Latitude: Latitude in degrees north of the equator
Longitude: Longitude in degrees west of the prime meridian
kable.nice(head(ustemperature))
City | State | JanTemp | Latitude | Longitude |
---|---|---|---|---|
Mobile | AL | 44 | 31.2 | 88.5 |
Montgomery | AL | 38 | 32.9 | 86.8 |
Phoenix | AZ | 35 | 33.6 | 112.5 |
LittleRock | AR | 31 | 35.4 | 92.8 |
LosAngeles | CA | 47 | 34.3 | 118.7 |
SanFrancisco | CA | 42 | 38.4 | 123.0 |
we want to develop a model that predicts the temperature from the longitude and latitude.
Let’s begin by considering the predictors:
ggplot(data=ustemperature, aes(Longitude, Latitude)) +
geom_point()
this however looks wrong, it is switched left to right. That is because every Longitude comes with East and West, and all of the US is in West. So we need to
ustemperature$Longitude <- -ustemperature$Longitude
we can do even better than just the scatterplot:
library(maps)
usa <- map_data("usa")
ggplot() +
geom_polygon(data = usa,
aes(x=long, y = lat, group = group),
alpha=0.1) +
coord_fixed(1.3) +
geom_point(data=ustemperature,
aes(Longitude, Latitude)) +
labs(x="Longitude", y="Latitude") +
scale_x_continuous(breaks = c(-120, -100, -80),
labels = c("120W", "100W", "80W"))
Now the relationship of Latitude to Temperature
ggplot(data=ustemperature, aes(Latitude, JanTemp)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
seems fairly linear.
ggplot(data=ustemperature, aes(Longitude, JanTemp)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
this does not. Let’s fit a polynomial model in Longitude:
fit.quad <- lm(JanTemp~poly(Longitude, 2),
data=ustemperature)
df <- data.frame(Residuals=resid(fit.quad),
Fits = fitted(fit.quad))
ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
not so good yet, so
fit.cube <- lm(JanTemp~poly(Longitude, 3),
data=ustemperature)
df <- data.frame(Residuals=resid(fit.cube),
Fits = fitted(fit.cube))
ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
and that is fine.
Now we put it together:
fit <- lm(JanTemp~Latitude + poly(Longitude, 3),
data=ustemperature)
summary(fit)
##
## Call:
## lm(formula = JanTemp ~ Latitude + poly(Longitude, 3), data = ustemperature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.569 -1.624 0.218 1.472 7.039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.39739 3.53301 33.512 < 2e-16
## Latitude -2.35772 0.08998 -26.202 < 2e-16
## poly(Longitude, 3)1 -15.99052 3.26685 -4.895 1.03e-05
## poly(Longitude, 3)2 36.26524 3.47734 10.429 3.02e-14
## poly(Longitude, 3)3 -27.59874 3.30506 -8.350 4.13e-11
##
## Residual standard error: 3.225 on 51 degrees of freedom
## Multiple R-squared: 0.9461, Adjusted R-squared: 0.9419
## F-statistic: 223.9 on 4 and 51 DF, p-value: < 2.2e-16
pushViewport(viewport(layout = grid.layout(1, 2)))
df <- data.frame(Residuals=resid(fit),
Fits = fitted(fit))
print(ggplot(data=df, aes(sample=Residuals)) +
geom_qq() + geom_qq_line(),
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0),
vp=viewport(layout.pos.row=1, layout.pos.col=2))
shows that this indeed a good model.
How can we visualize this model? Let’s do the following:
x <- seq(-130, -70, length=50)
y <- seq(25, 50, length=50)
xy <- expand.grid(x, y)
df <- data.frame(Longitude=xy[, 1],
Latitude=xy[, 2])
df$Temp <- predict(fit, df)
ggplot() +
geom_polygon(data = usa,
aes(x=long, y = lat, group = group),
alpha=0.1) +
coord_fixed(1.3) +
geom_point(data=df, aes(Longitude, Latitude, color=Temp)) +
scale_colour_gradient(low="blue", high="red") +
labs(x="Longitude", y="Latitude", color="Temperature") +
scale_x_continuous(breaks = c(-120, -100, -80),
labels = c("120W", "100W", "80W"))
It would be nicer, though, if we had points only in the US. We can use the data set us.cities in the map library to get coordinates:
df <- us.cities[, 5:4]
df <- df[df[, 2]<50, ] #not Alaska
df <- df[df[, 2]>25, ] #not Hawaii
colnames(df) <- c("Longitude", "Latitude")
df$Temp <- predict(fit, df)
ggplot() +
geom_polygon(data = usa,
aes(x=long, y = lat, group = group),
alpha=0.1) +
coord_fixed(1.3) +
geom_point(data=df, aes(Longitude, Latitude, color=Temp)) +
scale_colour_gradient(low="blue", high="red") +
labs(x="Longitude", y="Latitude", color="Temperature") +
scale_x_continuous(breaks = c(-120, -100, -80),
labels = c("120W", "100W", "80W"))
How about Puerto Rico?
df <- data.frame(Longitude=-65.1, Latitude=18.1)
predict(fit, df)
## 1
## 66.26492
Clearly this is an extrapolation!
Here is another way to understand this model: image we take a trip from New York (Long=73.94, Lat=40.67) to San Francisco (Long-122.45, Lat=37.77). How does our model say the temperature changes?
df <- data.frame(Longitude=seq(-122, -70, length=250),
Latitude=rep(40, 250))
df$Temp <- predict(fit, df)
ggplot(df, aes(Longitude, Temp)) +
geom_line()
How about a nonparametric fit?
fit.loess <- loess(JanTemp~Longitude+Latitude,
data=ustemperature)
df$Temp <- predict(fit.loess, df)
ggplot(df, aes(Longitude, Temp)) +
geom_line()
which looks fairly similar.