More than One Quantitative Predictor

Case Study: House Prices

Prices of residencies located 30 miles south of a large metropolitan area with several possible predictor variables. Notice the 1.7 baths!

attach(houseprice)
houseprice

##    Price   Sqfeet Floors Bedrooms Baths
## 1   69.0 1500.000      1        2   1.0
## 3  118.5 1880.952      1        2   2.0
## 4  104.0 1976.190      1        3   2.0
## 5  116.5 1880.952      1        3   2.0
## 6  121.5 1880.952      1        3   2.0
## 7  125.0 1976.190      1        3   2.0
## 8  128.0 2357.143      2        3   2.5
## 9  129.9 2166.667      1        3   1.7
## 10 133.0 2166.667      2        3   2.5
## 11 135.0 2166.667      2        3   2.5
## 12 137.5 2357.143      2        3   2.5
## 13 139.9 2166.667      1        3   2.0
## 14 143.9 2261.905      2        3   2.5
## 15 147.9 2547.619      2        3   2.5
## 16 154.9 2357.143      2        3   2.5
## 17 160.0 2738.095      2        3   2.0
## 18 169.0 2357.143      1        3   2.0
## 19 169.9 2642.857      1        3   2.0
## 20 125.0 2166.667      1        4   2.0
## 21 134.9 2166.667      1        4   2.0
## 22 139.9 2547.619      1        4   2.0
## 23 147.0 2642.857      1        4   2.0
## 24 159.0 2261.905      1        4   2.0
## 25 169.9 2547.619      2        4   3.0
## 26 178.9 2738.095      1        4   2.0
## 27 194.5 2833.333      2        4   3.0
## 28 219.9 2928.571      1        4   2.5
## 29 269.0 3309.524      2        4   3.0

Let’s go through the list of predictors one by one:

splot(Price, Sqfeet, add.line = 1)

cor(Price, Sqfeet)

## [1] 0.9152079

strong positive relationship (r=0.915), could be linear

bplot(Price, Floors)

cor(Price, Floors)

## [1] 0.2910762

weak if any positive relationship (r=0.291)

Note we used the boxplot here although Floors is a quantitative predictor. If the predictor has only a few different values (2 here!) this is often a better choice.

bplot(Price, Bedrooms)

cor(Price, Bedrooms)

## [1] 0.6045036

strong positive relationship (r=0.605), could be linear

bplot(Price, Baths)

cor(Price, Baths)

## [1] 0.6525626

strong positive relationship (r=0.652), could be linear

Note there is so far no mention of regression, residual vs. fits plot or normal plot. Making decisions about possible transformations and/or polynomial models early solely based on scatterplots and/or boxplots is usually a good idea.

Now to run the regression we have the routine mlr. As always the first argument is the response variable but now the second argument are all the predictors as a matrix:

mlr(Price, houseprice[, -1])

## The least squares regression equation is: 
##  Price  =  -67.62 + 0.086 Sqfeet - 26.493 Floors - 9.286 Bedrooms + 37.381 Baths 
## R^2 = 88.6%

For the assumptions there is nothing new, as before we need to check the residual vs. fits plot and the normal plot of residuals.

This appears to be a good model and the assumptions of normally distributed residuals with equal variance appears to be o.k.

Except,

Notice that there is something very strange about this model!

round(cor(houseprice[,-1]),3)

##          Sqfeet Floors Bedrooms Baths
## Sqfeet    1.000  0.370    0.652 0.628
## Floors    0.370  1.000   -0.018 0.743
## Bedrooms  0.652 -0.018    1.000 0.415
## Baths     0.628  0.743    0.415 1.000

The highest correlation between predictors is r=0.743 (Floors-Baths)

Best Subset Regression

We have previously talked about the fact that we want our models to be as simple as possible. Often that means a model with as few predictors as possible. So the question becomes:

Can we eliminate any of our predictors without making the model (stat. signif.) worse?

There are several things one can think of:

Choose based on R²

but we already know this will always lead to the model with all predictors, for the same reason that a cubic model always has an R² at least as high as the quadratic model.

Note:

Price by Sqfeet, Floors and Bedrooms: R²=80.1%
Price by Floors, Bedrooms and Baths: R²=68.4%
Price by Sqfeet, Bedrooms and Baths: R²=83.5%
Price by Sqfeet, Floors, Bedrooms and Baths: R²=88.2%
so model with all 4 has a higher R² than any of the models with just 3, but this will always be so, even if one of the predictors is completely useless.

Choose based on Hypothesis Tests

mlr(Price, houseprice[, -1], show.tests = TRUE)

## The least squares regression equation is: 
##  Price  =  -67.62 + 0.086 Sqfeet - 26.493 Floors - 9.286 Bedrooms + 37.381 Baths 
##  Variable p value
##  Constant   0.000
##    Sqfeet   0.000
##    Floors  0.0104
##  Bedrooms  0.1871
##     Baths  0.0057
## R^2 = 88.6%

so p_value of Bedrooms > 0.05, so eliminate Bedrooms.

This sounds like a good idea AND IT IS WIDELY USED IN REAL LIFE, but it turns out to be a bad one ! The reason why is bit hard to explain, though.

Use nested models test

all.preds <- mlr(Price, houseprice[, -1], return.model = TRUE)
without.bedrooms <- mlr(Price, houseprice[, -c(1, 4)], 
                        return.model = TRUE)
nested.models.test(all.preds, without.bedrooms)

## H0: both models are equally good.
## p value= 0.1871

Again, this sounds like a good idea AND AGAIN IT IS WIDELY USED IN REAL LIFE, but it turns out to be a dangerous one! To start, if we have several predictors we might want to eliminate, we immediately face the issue of simultaneous inference.

There are several methods in wide use that are essentially based on this idea, such as forward selection, backward selection and stepwise regression. These are sometimes unavoidable but need to be done with great care!

What we need is new idea:

Best Subset Regression and Mallow’s C_p

We will find ALL possible models and calculate Mallow’s C_p statistic for each. The model with the lowest C_p is best.

mallows(Price, houseprice[, -1])

##  Number of Variables Cp   Sqfeet Floors Bedrooms Baths
##  1                   8.83 X                           
##  2                   8.81 X                      X    
##  3                   4.85 X      X               X    
##  4                   5    X      X      X        X

so the best model uses Sqfeet, Floors and Baths.

To find the model we rerun mlr, now without Bedrooms:

mlr(Price, houseprice[, -c(1,4)])

## The least squares regression equation is: 
##  Price  =  -73.93 + 0.078 Sqfeet - 19.68 Floors + 30.579 Baths 
## R^2 = 87.7%

Note that the model with all four predictors has C_p=5.0. But C_p is a statistic, its exact value depends on the sample. So is the model with Sqfeet, Floors and Baths statistically significantly better than the model with all four predictors? We would need a hypothesis test to answer this question but this is not part of our course.

For more on Mallow’s C_p see page 603 of the textbook.

Prediction

Prediction works just as it did for simple regression. We have the command mlr.predict. Say we want to find a 90% interval estimate for a house that has 2000 sqfeet, one floor and two baths. Then

mlr.predict(Price, houseprice[ ,-c(1, 4)],
    newx=c(2000, 1, 2), interval="PI", conf.level=90)

##  Sqfeet Floors Baths    Fit Lower  Upper
##    2000      1     2 122.98 98.07 147.89

and so a 90% predicition interval is ($98,070, $147,890)

If we want to do prediction for a number of cases newx has to be a matrix:

newx <- cbind(c(2000,2100,2200), rep(1,3), rep(2,3))
newx

##      [,1] [,2] [,3]
## [1,] 2000    1    2
## [2,] 2100    1    2
## [3,] 2200    1    2

mlr.predict(Price, houseprice[ ,-c(1, 4)], 
            newx=newx, interval="PI")

##  Sqfeet Floors Baths    Fit  Lower  Upper
##    2000      1     2 122.98  92.93 153.03
##    2100      1     2 130.75 100.96 160.54
##    2200      1     2 138.52 108.86 168.18

Finally, not including the newx does the prediction of the dataset:

head(mlr.predict(Price, houseprice[ ,-c(1, 4)],
                 interval="PI"))

##    Sqfeet Floors Baths    Fit Lower  Upper
##  1500.000      1     1  53.54 18.63  88.46
##  1880.952      1     2 113.73 83.23 144.23
##  1976.190      1     2 121.13 91.01 151.25
##  1880.952      1     2 113.73 83.23 144.23
##  1880.952      1     2 113.73 83.23 144.23
##  1976.190      1     2 121.13 91.01 151.25

Case study: Air Pollution and Mortality

The dependent variable for analysis is age adjusted mortality (called “Mortality”). The data include variables measuring demographic characteristics of the cities, variables measuring climate characteristics, and variables recording the pollution potential of three different air pollutants.

attach(airpollution)
head(airpollution)

##                           Mortality JanTemp JulyTemp RelHum Rain Education
## AkronOH                      921.87      27       71     59   36      11.4
## Albany-Schenectady-TroyNY    997.87      23       72     57   35      11.0
## AllentownBethlehemPA-NJ      962.35      29       74     54   44       9.8
## AtlantaGA                    982.29      45       79     56   47      11.1
## BaltimoreMD                 1071.29      35       77     55   43       9.6
## BirminghamAL                1030.38      45       80     54   53      10.2
##                           PopDensity NonWhite WhiteCollar     Pop
## AkronOH                         3243      8.8        42.6  660328
## Albany-Schenectady-TroyNY       4281      3.5        50.7  835880
## AllentownBethlehemPA-NJ         4260      0.8        39.4  635481
## AtlantaGA                       3125     27.1        50.2 2138231
## BaltimoreMD                     6441     24.4        43.7 2199531
## BirminghamAL                    3325     38.5        43.1  883946
##                           Pop.House Income HCPot NOxPot SO2Pot NOx
## AkronOH                        3.34  29560    21     15     59  15
## Albany-Schenectady-TroyNY      3.14  31458     8     10     39  10
## AllentownBethlehemPA-NJ        3.21  31856     6      6     33   6
## AtlantaGA                      3.41  32452    18      8     24   8
## BaltimoreMD                    3.44  32368    43     38    206  38
## BirminghamAL                   3.45  27835    30     32     72  32

next we want to look at the scatterplots and the correlations. There are 15 predictors, so there are 15 graphs and correlations.

## Correlation:  -0.016

## Correlation:  0.322

## Correlation:  -0.101

## Correlation:  0.433

## Correlation:  -0.508

## Correlation:  0.252

## Correlation:  0.647

## Correlation:  -0.289

## Outliers!

## Correlation:  0.368

## Correlation:  -0.283

## Outliers!

## Outliers!

## Correlation:  0.419

## Outliers!

There are problems with four predictors (Pop, HCPot, NOx, and NOxPot), Let’s try the log transform and check again for those predictors:

The easiest way to do this is to make a new matrix:

detach(airpollution)
newair <- airpollution
newair[ ,c("Pop", "HCPot", "NOx", "NOxPot")] <- log(newair[, c("Pop", "HCPot", "NOx", "NOxPot")])
colnames(newair)[c(10, 13, 14, 16)] <- c("log(Pop)", "log(HCPot)", "log(NOx)", "log(NOxPot)")
attach(newair)
for(i in c(10, 13, 14, 16)) {
   splot(Mortality, newair[, i], add.line=1, 
      label_x = colnames(newair)[i])
   cat("Correlation: ", 
        round(cor(Mortality, newair[, i]), 3))
}

## Correlation:  0.085

## Correlation:  0.125

## Correlation:  0.28

## Correlation:  0.28

so in all cases the log transform worked, and we will use newair from now on.

Let’s find the correlations in absolute value of the predictors with the response, in order:

cors <- round(cor(newair), 2)
sort(abs(cors[ ,"Mortality"]), decreasing = TRUE)[-1]

##    NonWhite   Education        Rain      SO2Pot   Pop.House    JulyTemp 
##        0.65        0.51        0.43        0.42        0.37        0.32 
## WhiteCollar      Income    log(NOx) log(NOxPot)  PopDensity  log(HCPot) 
##        0.29        0.28        0.28        0.28        0.25        0.13 
##      RelHum    log(Pop)     JanTemp 
##        0.10        0.09        0.02

Next we look at the correlations between the predictors.

cors[-1, -1]

##             JanTemp JulyTemp RelHum  Rain Education PopDensity NonWhite
## JanTemp        1.00     0.32   0.09  0.06      0.11      -0.08     0.46
## JulyTemp       0.32     1.00  -0.44  0.47     -0.27      -0.01     0.60
## RelHum         0.09    -0.44   1.00 -0.12      0.19      -0.15    -0.12
## Rain           0.06     0.47  -0.12  1.00     -0.47       0.08     0.30
## Education      0.11    -0.27   0.19 -0.47      1.00      -0.24    -0.21
## PopDensity    -0.08    -0.01  -0.15  0.08     -0.24       1.00    -0.01
## NonWhite       0.46     0.60  -0.12  0.30     -0.21      -0.01     1.00
## WhiteCollar    0.21    -0.01   0.01 -0.11      0.49       0.25    -0.06
## log(Pop)       0.32     0.04  -0.02 -0.28      0.27       0.21     0.22
## Pop.House     -0.33     0.26  -0.14  0.20     -0.39      -0.17     0.35
## Income         0.20    -0.19   0.13 -0.36      0.51       0.00    -0.10
## log(HCPot)     0.23    -0.41   0.18 -0.48      0.18       0.26     0.15
## log(NOx)       0.18    -0.30   0.10 -0.39      0.03       0.34     0.21
## SO2Pot        -0.09    -0.07  -0.12 -0.13     -0.23       0.42     0.16
## log(NOxPot)    0.18    -0.30   0.10 -0.39      0.03       0.34     0.21
##             WhiteCollar log(Pop) Pop.House Income log(HCPot) log(NOx)
## JanTemp            0.21     0.32     -0.33   0.20       0.23     0.18
## JulyTemp          -0.01     0.04      0.26  -0.19      -0.41    -0.30
## RelHum             0.01    -0.02     -0.14   0.13       0.18     0.10
## Rain              -0.11    -0.28      0.20  -0.36      -0.48    -0.39
## Education          0.49     0.27     -0.39   0.51       0.18     0.03
## PopDensity         0.25     0.21     -0.17   0.00       0.26     0.34
## NonWhite          -0.06     0.22      0.35  -0.10       0.15     0.21
## WhiteCollar        1.00     0.28     -0.35   0.37       0.16     0.11
## log(Pop)           0.28     1.00     -0.26   0.41       0.48     0.50
## Pop.House         -0.35    -0.26      1.00  -0.30      -0.22    -0.12
## Income             0.37     0.41     -0.30   1.00       0.29     0.25
## log(HCPot)         0.16     0.48     -0.22   0.29       1.00     0.94
## log(NOx)           0.11     0.50     -0.12   0.25       0.94     1.00
## SO2Pot            -0.06     0.37     -0.01   0.07       0.57     0.68
## log(NOxPot)        0.11     0.50     -0.12   0.25       0.94     1.00
##             SO2Pot log(NOxPot)
## JanTemp      -0.09        0.18
## JulyTemp     -0.07       -0.30
## RelHum       -0.12        0.10
## Rain         -0.13       -0.39
## Education    -0.23        0.03
## PopDensity    0.42        0.34
## NonWhite      0.16        0.21
## WhiteCollar  -0.06        0.11
## log(Pop)      0.37        0.50
## Pop.House    -0.01       -0.12
## Income        0.07        0.25
## log(HCPot)    0.57        0.94
## log(NOx)      0.68        1.00
## SO2Pot        1.00        0.68
## log(NOxPot)   0.68        1.00

We find:

there are sizable correlations (for example cor(NonWhite,JulyTemp)=0.60).

Because of this unterpreting (understanding) the final model will be difficult.

LOGT(NOxPot) and LOGT(NOx) are perfectly correlated.

Using perfectly correlated predictors is not possible so we eliminate one of them, say log(NOx):

detach(newair)
newair <- newair[, -16]
attach(newair)

Next we fit a model with all the predictors and check the assumptions:

mlr(Mortality , newair[, -1] )

## The least squares regression equation is: 
##  Mortality  =  1230.17 - 1.885 JanTemp - 1.793 JulyTemp + 0.532 RelHum + 1.414 Rain - 7.569 Education + 0.004 PopDensity + 5.012 NonWhite - 1.796 WhiteCollar + 2.828 log(Pop) - 34.656 Pop.House - 0.001 Income - 22.638 log(HCPot) + 27.848 log(NOx) + 0.091 SO2Pot 
## R^2 = 77%

The residual vs fits plot looks fine, so there is no problem with the model.

The normal plot is ok, so no problem with the nomal assumption. The residual vs fits plot looks fine, so there is no problem with the equal variance assumption.

Next we use the best subset regression to see whether we can find a model with fewer predictors.

mallows(Mortality, newair[, -1])

##  Number of Variables Cp    JanTemp JulyTemp RelHum Rain Education
##  1                   56.46                                       
##  2                   30.59                              X        
##  3                   18.61 X                            X        
##  4                   9.07  X                       X             
##  5                   4.98  X                       X    X        
##  6                   4.32  X                       X             
##  7                   4.9   X                       X             
##  8                   6.05  X       X               X             
##  9                   6.98  X       X               X             
##  10                  8.14  X       X               X    X        
##  11                  9.55  X       X               X    X        
##  12                  11.29 X       X        X      X    X        
##  13                  13.14 X       X        X      X    X        
##  14                  15    X       X        X      X    X        
##  PopDensity NonWhite WhiteCollar log(Pop) Pop.House Income log(HCPot)
##             X                                                        
##             X                                                        
##             X                                                        
##             X                                                        
##             X                                                        
##  X          X        X                                               
##  X          X        X                                     X         
##  X          X        X                                     X         
##  X          X        X                                     X         
##  X          X        X                    X                X         
##  X          X        X                    X                X         
##  X          X        X                    X                X         
##  X          X        X                    X         X      X         
##  X          X        X           X        X         X      X         
##  log(NOx) SO2Pot
##                 
##                 
##                 
##  X              
##  X              
##  X              
##  X              
##  X              
##  X        X     
##  X              
##  X        X     
##  X        X     
##  X        X     
##  X        X

It suggests a model based on JanTemp, Rain, PopDensity, NonWhite, WhiteCollar and LOGT(NOx) with Mallow’s C_p=4.32

mlr(Mortality, 
    newair[, c("JanTemp", "Rain", "PopDensity", "NonWhite",
               "WhiteCollar", "log(NOx)")])

## The least squares regression equation is: 
##  Mortality  =  944.275 - 1.942 JanTemp + 1.924 Rain + 0.006 PopDensity + 4.194 NonWhite - 2.723 WhiteCollar + 17 log(NOx) 
## R^2 = 74.2%

Because the best model does still include one of the pollution variables, we can conclude that pollution adds to the mortality rate.

And we are done!