Multiple Linear Regression

Example: House Prices

Prices of residencies located 30 miles south of a large metropolitan area with several possible predictor variables.

Notice the 1.7 baths!

kable.nice(houseprice)
Price Sqfeet Floors Bedrooms Baths
1 69.0 1500.000 1 2 1.0
3 118.5 1880.952 1 2 2.0
4 104.0 1976.190 1 3 2.0
5 116.5 1880.952 1 3 2.0
6 121.5 1880.952 1 3 2.0
7 125.0 1976.190 1 3 2.0
8 128.0 2357.143 2 3 2.5
9 129.9 2166.667 1 3 1.7
10 133.0 2166.667 2 3 2.5
11 135.0 2166.667 2 3 2.5
12 137.5 2357.143 2 3 2.5
13 139.9 2166.667 1 3 2.0
14 143.9 2261.905 2 3 2.5
15 147.9 2547.619 2 3 2.5
16 154.9 2357.143 2 3 2.5
17 160.0 2738.095 2 3 2.0
18 169.0 2357.143 1 3 2.0
19 169.9 2642.857 1 3 2.0
20 125.0 2166.667 1 4 2.0
21 134.9 2166.667 1 4 2.0
22 139.9 2547.619 1 4 2.0
23 147.0 2642.857 1 4 2.0
24 159.0 2261.905 1 4 2.0
25 169.9 2547.619 2 4 3.0
26 178.9 2738.095 1 4 2.0
27 194.5 2833.333 2 4 3.0
28 219.9 2928.571 1 4 2.5
29 269.0 3309.524 2 4 3.0

Let’s go through the list of predictors one by one:

pushViewport(viewport(layout = grid.layout(2, 2)))
pos <- expand.grid(1:2, 1:2)
for(i in 1:4) {
  plt <- 
    ggplot(data=houseprice, 
           aes_(x = as.name(names(houseprice)[i+1]), 
                y = as.name(names(houseprice)[1]))) +
    geom_point() +
    geom_smooth(method = "lm", se=FALSE) 
  print(plt, 
  vp=viewport(layout.pos.row=pos[i, 1], 
              layout.pos.col=pos[i, 2]))
  cat("Price and ", colnames(houseprice)[i+1], 
      " ", round(cor(houseprice$Price, houseprice[, i+1]), 3), "\n")
}

## Price and  Sqfeet   0.915 
## Price and  Floors   0.291 
## Price and  Bedrooms   0.605 
## Price and  Baths   0.653
fit <- lm(Price~., data=houseprice)
summary(fit)
## 
## Call:
## lm(formula = Price ~ ., data = houseprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.018  -5.943   1.860   5.947  30.955 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.61984   17.70818  -3.819 0.000882
## Sqfeet        0.08571    0.01076   7.966 4.62e-08
## Floors      -26.49306    9.48952  -2.792 0.010363
## Bedrooms     -9.28622    6.82985  -1.360 0.187121
## Baths        37.38067   12.26436   3.048 0.005709
## 
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared:  0.8862, Adjusted R-squared:  0.8665 
## F-statistic:  44.8 on 4 and 23 DF,  p-value: 1.558e-10

For the assumptions there is nothing new, as before we need to check the residual vs. fits plot and the normal plot of residuals:

pushViewport(viewport(layout = grid.layout(1, 2)))
df <- data.frame(Residuals=resid(fit), 
            Fits = fitted(fit))
print(ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line(),
    vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0),
    vp=viewport(layout.pos.row=1, layout.pos.col=2))        

This appears to be a good model and the assumptions of normally distributed residuals with equal variance appears to be o.k.

Except,

Notice that there is something very strange about this model!

Let’s have a look at the correlations between the predictors:

round(cor(houseprice[, -1]), 3)
##          Sqfeet Floors Bedrooms Baths
## Sqfeet    1.000  0.370    0.652 0.628
## Floors    0.370  1.000   -0.018 0.743
## Bedrooms  0.652 -0.018    1.000 0.415
## Baths     0.628  0.743    0.415 1.000

The highest correlation between predictors is r=0.743 (Floors-Baths)

As in the case of polynomial regression, highly correlated predictors are a potential problem. We will look at a solution called principle components at some point. Here we are ok.

Variable Selection

We have previously talked about the fact that we want our models to be as simple as possible. Often that means a model with as few predictors as possible. So the question becomes:

Can we eliminate any of our predictors without making the model (stat. signif.) worse?

There are several things one can think of:

Choose based on R2

but we already know this will always lead to the model with all predictors, for the same reason that a cubic model always has an R2 at least as high as the quadratic model.

Note:

Price by Sqfeet, Floors and Bedrooms: R2=80.1%
Price by Floors, Bedrooms and Baths: R2=68.4%
Price by Sqfeet, Bedrooms and Baths: R2=83.5%
Price by Sqfeet, Floors, Bedrooms and Baths: R2=88.2%
so model with all 4 has a higher R2 than any of the models with just 3, but this will always be so, even if one of the predictors is completely useless.

Choose based on Hypothesis Tests

in the summary(fit) above we see that p_value of Bedrooms = 0.187121 > 0.05, so eliminate Bedrooms.

This sounds like a good idea AND IT IS WIDELY USED IN REAL LIFE, but it turns out to be a bad one ! The reason why is bit hard to explain, though.

Use nested models test

fit.without.bedrooms <- lm(Price~.-Bedrooms,
                           data=houseprice)
anova(fit, fit.without.bedrooms)
## Analysis of Variance Table
## 
## Model 1: Price ~ Sqfeet + Floors + Bedrooms + Baths
## Model 2: Price ~ (Sqfeet + Floors + Bedrooms + Baths) - Bedrooms
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     23 4321.9                           
## 2     24 4669.2 -1   -347.38 1.8487 0.1871

Again, this sounds like a good idea AND AGAIN IT IS WIDELY USED IN REAL LIFE, but it turns out to be a dangerous one! To start, if we have several predictors we might want to eliminate, we immediately face the issue of simultaneous inference.

There are several methods in wide use that are essentially based on this idea, such as forward selection, backward selection and stepwise regression. These are sometimes unavoidable but need to be done with great care!


What we need is new idea:

Best Subset Regression and Mallow’s Cp

We will find ALL possible models and calculate Mallow’s Cp statistic for each. The model with the lowest Cp is best.

library(leaps)
out <- leaps(houseprice[, -1], houseprice$Price, 
      method = "Cp", nbest=1)
out
## $which
##      1     2     3     4
## 1 TRUE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE  TRUE
## 3 TRUE  TRUE FALSE  TRUE
## 4 TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## 
## $size
## [1] 2 3 4 5
## 
## $Cp
## [1] 8.834171 8.812489 4.848657 5.000000
colnames(houseprice)[2:5][out$which[seq_along(out$Cp)[out$Cp==min(out$Cp)], ]]
## [1] "Sqfeet" "Floors" "Baths"

so the best model uses Sqfeet, Floors and Baths.

To find the model we rerun lm, now without Bedrooms:

summary(fit)
## 
## Call:
## lm(formula = Price ~ ., data = houseprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.018  -5.943   1.860   5.947  30.955 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.61984   17.70818  -3.819 0.000882
## Sqfeet        0.08571    0.01076   7.966 4.62e-08
## Floors      -26.49306    9.48952  -2.792 0.010363
## Bedrooms     -9.28622    6.82985  -1.360 0.187121
## Baths        37.38067   12.26436   3.048 0.005709
## 
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared:  0.8862, Adjusted R-squared:  0.8665 
## F-statistic:  44.8 on 4 and 23 DF,  p-value: 1.558e-10

Note that the model with all four predictors has Cp=5.0. But Cp is a statistic, its exact value depends on the sample. So is the model with Sqfeet, Floors and Baths statistically significantly better than the model with all four predictors? We would need a hypothesis test to answer this question but this is not part of our course.

Prediction

Prediction works just as it did for simple regression. Say we want to find a 90% interval estimate for a house that has 2000 sqfeet, one floor and two baths. Then

predict(fit.without.bedrooms, 
        newdata=data.frame(Sqfeet=2000, 
                           Floors=1, 
                           Bedrooms=0,
                           Baths=2), 
        interval="prediction", level=0.9)
##        fit      lwr     upr
## 1 122.9797 98.07234 147.887

Example: Air Pollution and Mortality

The dependent variable for analysis is age adjusted mortality (called “Mortality”). The data include variables measuring demographic characteristics of the cities, variables measuring climate characteristics, and variables recording the pollution potential of three different air pollutants.

x <- airpollution
rownames(x) <- NULL
kable.nice(head(airpollution[, 1:5], 3))
Mortality JanTemp JulyTemp RelHum Rain
AkronOH 921.87 27 71 59 36
Albany-Schenectady-TroyNY 997.87 23 72 57 35
AllentownBethlehemPA-NJ 962.35 29 74 54 44
kable.nice(head(x[, 6:10], 3))
Education PopDensity NonWhite WhiteCollar Pop
11.4 3243 8.8 42.6 660328
11.0 4281 3.5 50.7 835880
9.8 4260 0.8 39.4 635481
kable.nice(head(x[, 11:16], 3))
Pop.House Income HCPot NOxPot SO2Pot NOx
3.34 29560 21 15 59 15
3.14 31458 8 10 39 10
3.21 31856 6 6 33 6

we want to look at the scatterplots and the correlations. There are 15 predictors, so there are 15 graphs and correlations.

pos <- expand.grid(1:2, 1:2)
pushViewport(viewport(layout = grid.layout(2, 2)))
for(i in 1:4) {
   plt <- 
       ggplot(data=airpollution, 
           aes_(x = as.name(names(airpollution)[i+1]), 
                y = as.name(names(airpollution)[1]))) +
    geom_point() +
    geom_smooth(method = "lm", se=FALSE) 
    print(plt, 
    vp=viewport(layout.pos.row=pos[i, 1], 
              layout.pos.col=pos[i, 2]))
    cat("Mortality and ", colnames(airpollution)[i+1], 
      " ", round(cor(airpollution$Mortality, 
                     airpollution[, i+1]), 3), "\n")
}  

## Mortality and  JanTemp   -0.016 
## Mortality and  JulyTemp   0.322 
## Mortality and  RelHum   -0.101 
## Mortality and  Rain   0.433
pushViewport(viewport(layout = grid.layout(2, 2)))
for(i in 5:8) {
   plt <- 
       ggplot(data=airpollution, 
           aes_(x = as.name(names(airpollution)[i+1]), 
                y = as.name(names(airpollution)[1]))) +
    geom_point() +
    geom_smooth(method = "lm", se=FALSE) 
    print(plt, 
    vp=viewport(layout.pos.row=pos[i-4, 1], 
              layout.pos.col=pos[i-4, 2]))
    cat("Mortality and ", colnames(airpollution)[i+1], 
      " ", round(cor(airpollution$Mortality, 
                     airpollution[, i+1]), 3), "\n")
}  

## Mortality and  Education   -0.508 
## Mortality and  PopDensity   0.252 
## Mortality and  NonWhite   0.647 
## Mortality and  WhiteCollar   -0.289
pushViewport(viewport(layout = grid.layout(2, 2)))
for(i in 9:11) {
   plt <- 
      ggplot(data=airpollution, 
           aes_(x = as.name(names(airpollution)[i+1]), 
                y = as.name(names(airpollution)[1]))) +
    geom_point() +
    geom_smooth(method = "lm", se=FALSE) 
    print(plt, 
    vp=viewport(layout.pos.row=pos[i-8, 1], 
              layout.pos.col=pos[i-8, 2]))
    cat("Mortality and ", colnames(airpollution)[i+1], 
      " ", round(cor(airpollution$Mortality, 
                     airpollution[, i+1]), 3), "\n")
}  

## Mortality and  Pop   0.059 
## Mortality and  Pop.House   0.368 
## Mortality and  Income   -0.283
pushViewport(viewport(layout = grid.layout(2, 2)))
for(i in 12:15) {
   plt <- 
      ggplot(data=airpollution, 
           aes_(x = as.name(names(airpollution)[i+1]), 
                y = as.name(names(airpollution)[1]))) +
    geom_point() +
    geom_smooth(method = "lm", se=FALSE) 
    print(plt, 
    vp=viewport(layout.pos.row=pos[i-11, 1], 
              layout.pos.col=pos[i-11, 2]))
    cat("Mortality and ", colnames(airpollution)[i+1], 
      " ", round(cor(airpollution$Mortality, 
                     airpollution[, i+1]), 3), "\n")
}  

## Mortality and  HCPot   -0.185 
## Mortality and  NOxPot   -0.085 
## Mortality and  SO2Pot   0.419 
## Mortality and  NOx   -0.085

There are problems with four predictors (Pop, HCPot, NOx, and NOxPot). Let’s try the log transform and check again for those predictors:

The easiest way to do this is to make a new matrix:

newair <- airpollution
newair[ ,c("Pop", "HCPot", "NOx", "NOxPot")] <- 
  log(newair[, c("Pop", "HCPot", "NOx", "NOxPot")])
colnames(newair)[c(10, 13, 14, 16)] <- c("log(Pop)", "log(HCPot)", "log(NOx)", "log(NOxPot)")
pushViewport(viewport(layout = grid.layout(2, 2)))
k <- 0
for(i in c(10, 13, 14, 16)) {
    k <- k+1
    plt <- 
      ggplot(data=newair, 
           aes_(x = as.name(names(newair)[i]), 
                y = as.name(names(newair)[1]))) +
    geom_point() +
    geom_smooth(method = "lm", se=FALSE) 
    print(plt, 
    vp=viewport(layout.pos.row=pos[k, 1], 
              layout.pos.col=pos[k, 2]))
    cat("Mortality and ", colnames(newair)[i], 
      " ", round(cor(newair$Mortality, 
                     newair[, i]), 3), "\n")
} 

## Mortality and  log(Pop)   0.085 
## Mortality and  log(HCPot)   0.125 
## Mortality and  log(NOx)   0.28 
## Mortality and  log(NOxPot)   0.28

so in all cases the log transform worked, and we will use newair from now on.

Let’s find the correlations in absolute value of the predictors with the response, in order:

cors <- round(cor(newair), 2)
sort(abs(cors[ ,"Mortality"]), decreasing = TRUE)[-1]
##    NonWhite   Education        Rain      SO2Pot   Pop.House    JulyTemp 
##        0.65        0.51        0.43        0.42        0.37        0.32 
## WhiteCollar      Income    log(NOx) log(NOxPot)  PopDensity  log(HCPot) 
##        0.29        0.28        0.28        0.28        0.25        0.13 
##      RelHum    log(Pop)     JanTemp 
##        0.10        0.09        0.02

Next we look at the correlations between the predictors.

cors[-1, -1]
##             JanTemp JulyTemp RelHum  Rain Education PopDensity NonWhite
## JanTemp        1.00     0.32   0.09  0.06      0.11      -0.08     0.46
## JulyTemp       0.32     1.00  -0.44  0.47     -0.27      -0.01     0.60
## RelHum         0.09    -0.44   1.00 -0.12      0.19      -0.15    -0.12
## Rain           0.06     0.47  -0.12  1.00     -0.47       0.08     0.30
## Education      0.11    -0.27   0.19 -0.47      1.00      -0.24    -0.21
## PopDensity    -0.08    -0.01  -0.15  0.08     -0.24       1.00    -0.01
## NonWhite       0.46     0.60  -0.12  0.30     -0.21      -0.01     1.00
## WhiteCollar    0.21    -0.01   0.01 -0.11      0.49       0.25    -0.06
## log(Pop)       0.32     0.04  -0.02 -0.28      0.27       0.21     0.22
## Pop.House     -0.33     0.26  -0.14  0.20     -0.39      -0.17     0.35
## Income         0.20    -0.19   0.13 -0.36      0.51       0.00    -0.10
## log(HCPot)     0.23    -0.41   0.18 -0.48      0.18       0.26     0.15
## log(NOx)       0.18    -0.30   0.10 -0.39      0.03       0.34     0.21
## SO2Pot        -0.09    -0.07  -0.12 -0.13     -0.23       0.42     0.16
## log(NOxPot)    0.18    -0.30   0.10 -0.39      0.03       0.34     0.21
##             WhiteCollar log(Pop) Pop.House Income log(HCPot) log(NOx)
## JanTemp            0.21     0.32     -0.33   0.20       0.23     0.18
## JulyTemp          -0.01     0.04      0.26  -0.19      -0.41    -0.30
## RelHum             0.01    -0.02     -0.14   0.13       0.18     0.10
## Rain              -0.11    -0.28      0.20  -0.36      -0.48    -0.39
## Education          0.49     0.27     -0.39   0.51       0.18     0.03
## PopDensity         0.25     0.21     -0.17   0.00       0.26     0.34
## NonWhite          -0.06     0.22      0.35  -0.10       0.15     0.21
## WhiteCollar        1.00     0.28     -0.35   0.37       0.16     0.11
## log(Pop)           0.28     1.00     -0.26   0.41       0.48     0.50
## Pop.House         -0.35    -0.26      1.00  -0.30      -0.22    -0.12
## Income             0.37     0.41     -0.30   1.00       0.29     0.25
## log(HCPot)         0.16     0.48     -0.22   0.29       1.00     0.94
## log(NOx)           0.11     0.50     -0.12   0.25       0.94     1.00
## SO2Pot            -0.06     0.37     -0.01   0.07       0.57     0.68
## log(NOxPot)        0.11     0.50     -0.12   0.25       0.94     1.00
##             SO2Pot log(NOxPot)
## JanTemp      -0.09        0.18
## JulyTemp     -0.07       -0.30
## RelHum       -0.12        0.10
## Rain         -0.13       -0.39
## Education    -0.23        0.03
## PopDensity    0.42        0.34
## NonWhite      0.16        0.21
## WhiteCollar  -0.06        0.11
## log(Pop)      0.37        0.50
## Pop.House    -0.01       -0.12
## Income        0.07        0.25
## log(HCPot)    0.57        0.94
## log(NOx)      0.68        1.00
## SO2Pot        1.00        0.68
## log(NOxPot)   0.68        1.00

We find:

  1. there are sizable correlations (for example cor(NonWhite,JulyTemp)=0.60).

Because of this interpreting (understanding) the final model will be difficult.

  1. LOGT(NOxPot) and LOGT(NOx) are perfectly correlated.

Using perfectly correlated predictors is not possible so we eliminate one of them, say log(NOx):

newair <- newair[, -16]

Next we fit a model with all the predictors and check the assumptions:

fit <- lm(Mortality~., data=newair)
pushViewport(viewport(layout = grid.layout(1, 2)))
df <- data.frame(Residuals=resid(fit), 
            Fits = fitted(fit))
print(ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line(),
    vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0),
    vp=viewport(layout.pos.row=1, layout.pos.col=2))        

The residual vs fits plot looks fine, so there is no problem with the model.

The normal plot is ok, so no problem with the normal assumption. The residual vs fits plot looks fine, so there is no problem with the equal variance assumption.

Next we use the best subset regression to see whether we can find a model with fewer predictors.

out <- leaps(newair[, -1], newair$Mortality, 
      method = "Cp", nbest=1)
colnames(newair)[-1][out$which[seq_along(out$Cp)[out$Cp==min(out$Cp)], ]]
## [1] "JanTemp"     "Rain"        "PopDensity"  "NonWhite"    "WhiteCollar"
## [6] "log(NOx)"

It suggests a model based on JanTemp, Rain, PopDensity, NonWhite, WhiteCollar and LOGT(NOx) with Mallow’s Cp=4.32

df <-  newair[, c("Mortality", 
                  "JanTemp", 
                  "Rain", 
                  "PopDensity", 
                  "NonWhite",
                  "WhiteCollar", 
                  "log(NOx)")]
fit <- lm(Mortality~., data=df)
summary(fit)
## 
## Call:
## lm(formula = Mortality ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.749 -24.087   3.249  19.473  93.474 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 944.274572  47.170880  20.018  < 2e-16
## JanTemp      -1.941998   0.516119  -3.763 0.000428
## Rain          1.924283   0.484198   3.974 0.000219
## PopDensity    0.006437   0.003608   1.784 0.080282
## NonWhite      4.194072   0.620038   6.764 1.18e-08
## WhiteCollar  -2.723255   0.947170  -2.875 0.005839
## `log(NOx)`   17.000178   4.879795   3.484 0.001012
## 
## Residual standard error: 33.46 on 52 degrees of freedom
## Multiple R-squared:  0.7425, Adjusted R-squared:  0.7127 
## F-statistic: 24.98 on 6 and 52 DF,  p-value: 1.028e-13

Because the best model does still include one of the pollution variables, we can conclude that pollution adds to the mortality rate.

And we are done!

Example: US Temperatures

The data gives the normal average January minimum temperature in degrees Fahrenheit with the latitude and longitude of 56 U.S. cities. (For each year from 1931 to 1960, the daily minimum temperatures in January were added together and divided by 31. Then, the averages for each year were averaged over the 30 years.)

Variables:
City: City
State: State postal abbreviation
JanTemp: Average January minimum temperature in degrees F.
Latitude: Latitude in degrees north of the equator
Longitude: Longitude in degrees west of the prime meridian

kable.nice(head(ustemperature))
City State JanTemp Latitude Longitude
Mobile AL 44 31.2 88.5
Montgomery AL 38 32.9 86.8
Phoenix AZ 35 33.6 112.5
LittleRock AR 31 35.4 92.8
LosAngeles CA 47 34.3 118.7
SanFrancisco CA 42 38.4 123.0

we want to develop a model that predicts the temperature from the longitude and latitude.

Let’s begin by considering the predictors:

ggplot(data=ustemperature, aes(Longitude, Latitude)) +
  geom_point() 

this however looks wrong, it is switched left to right. That is because every Longitude comes with East and West, and all of the US is in West. So we need to

ustemperature$Longitude <- -ustemperature$Longitude

we can do even better than just the scatterplot:

library(maps)
usa <- map_data("usa") 
ggplot() + 
  geom_polygon(data = usa, 
              aes(x=long, y = lat, group = group),
              alpha=0.1) + 
  coord_fixed(1.3) +
  geom_point(data=ustemperature,
             aes(Longitude, Latitude)) +
  labs(x="Longitude", y="Latitude") +
  scale_x_continuous(breaks = c(-120, -100, -80), 
                     labels = c("120W", "100W", "80W"))

Now the relationship of Latitude to Temperature

ggplot(data=ustemperature, aes(Latitude, JanTemp)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE)

seems fairly linear.

ggplot(data=ustemperature, aes(Longitude, JanTemp)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE)

this does not. Let’s fit a polynomial model in Longitude:

fit.quad <- lm(JanTemp~poly(Longitude, 2), 
               data=ustemperature)
df <- data.frame(Residuals=resid(fit.quad), 
            Fits = fitted(fit.quad))
ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0)        

not so good yet, so

fit.cube <- lm(JanTemp~poly(Longitude, 3), 
               data=ustemperature)
df <- data.frame(Residuals=resid(fit.cube), 
            Fits = fitted(fit.cube))
ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0)        

and that is fine.

Now we put it together:

fit <- lm(JanTemp~Latitude + poly(Longitude, 3), 
               data=ustemperature)
summary(fit)
## 
## Call:
## lm(formula = JanTemp ~ Latitude + poly(Longitude, 3), data = ustemperature)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.569 -1.624  0.218  1.472  7.039 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)         118.39739    3.53301  33.512  < 2e-16
## Latitude             -2.35772    0.08998 -26.202  < 2e-16
## poly(Longitude, 3)1 -15.99052    3.26685  -4.895 1.03e-05
## poly(Longitude, 3)2  36.26524    3.47734  10.429 3.02e-14
## poly(Longitude, 3)3 -27.59874    3.30506  -8.350 4.13e-11
## 
## Residual standard error: 3.225 on 51 degrees of freedom
## Multiple R-squared:  0.9461, Adjusted R-squared:  0.9419 
## F-statistic: 223.9 on 4 and 51 DF,  p-value: < 2.2e-16
pushViewport(viewport(layout = grid.layout(1, 2)))
df <- data.frame(Residuals=resid(fit), 
            Fits = fitted(fit))
print(ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line(),
    vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0),
    vp=viewport(layout.pos.row=1, layout.pos.col=2))        

shows that this indeed a good model.

How can we visualize this model? Let’s do the following:

  • find a fine grid of points in the US
  • for each such point use the model to predict the temperature - draw the map with these predictions
x <- seq(-130, -70, length=50)
y <- seq(25, 50, length=50)
xy <- expand.grid(x, y)
df <- data.frame(Longitude=xy[, 1],
                  Latitude=xy[, 2])
df$Temp <- predict(fit, df)
ggplot() + 
  geom_polygon(data = usa, 
              aes(x=long, y = lat, group = group),
              alpha=0.1) + 
  coord_fixed(1.3) +
  geom_point(data=df, aes(Longitude, Latitude, color=Temp)) +
  scale_colour_gradient(low="blue", high="red") +
  labs(x="Longitude", y="Latitude", color="Temperature") +
  scale_x_continuous(breaks = c(-120, -100, -80), 
                     labels = c("120W", "100W", "80W"))

It would be nicer, though, if we had points only in the US. We can use the data set us.cities in the map library to get coordinates:

df <- us.cities[, 5:4]
df <- df[df[, 2]<50, ] #not Alaska
df <- df[df[, 2]>25, ] #not Hawaii
colnames(df) <- c("Longitude", "Latitude")
df$Temp <- predict(fit, df)
ggplot() + 
  geom_polygon(data = usa, 
              aes(x=long, y = lat, group = group),
              alpha=0.1) + 
  coord_fixed(1.3) +
  geom_point(data=df, aes(Longitude, Latitude, color=Temp)) +
  scale_colour_gradient(low="blue", high="red") +
  labs(x="Longitude", y="Latitude", color="Temperature") +
  scale_x_continuous(breaks = c(-120, -100, -80), 
                     labels = c("120W", "100W", "80W"))

How about Puerto Rico?

df <- data.frame(Longitude=-65.1, Latitude=18.1)
predict(fit, df)
##        1 
## 66.26492

Clearly this is an extrapolation!

Here is another way to understand this model: image we take a trip from New York (Long=73.94, Lat=40.67) to San Francisco (Long-122.45, Lat=37.77). How does our model say the temperature changes?

df <- data.frame(Longitude=seq(-122, -70, length=250),
                 Latitude=rep(40, 250))
df$Temp <- predict(fit, df)
ggplot(df, aes(Longitude, Temp)) +
  geom_line()

How about a nonparametric fit?

fit.loess <- loess(JanTemp~Longitude+Latitude,
                   data=ustemperature)
df$Temp <- predict(fit.loess, df)
ggplot(df, aes(Longitude, Temp)) +
  geom_line()

which looks fairly similar.