Prices of residencies located 30 miles south of a large metropolitan area with several possible predictor variables. Notice the 1.7 baths!
attach(houseprice)
houseprice
## Price Sqfeet Floors Bedrooms Baths
## 1 69.0 1500.000 1 2 1.0
## 3 118.5 1880.952 1 2 2.0
## 4 104.0 1976.190 1 3 2.0
## 5 116.5 1880.952 1 3 2.0
## 6 121.5 1880.952 1 3 2.0
## 7 125.0 1976.190 1 3 2.0
## 8 128.0 2357.143 2 3 2.5
## 9 129.9 2166.667 1 3 1.7
## 10 133.0 2166.667 2 3 2.5
## 11 135.0 2166.667 2 3 2.5
## 12 137.5 2357.143 2 3 2.5
## 13 139.9 2166.667 1 3 2.0
## 14 143.9 2261.905 2 3 2.5
## 15 147.9 2547.619 2 3 2.5
## 16 154.9 2357.143 2 3 2.5
## 17 160.0 2738.095 2 3 2.0
## 18 169.0 2357.143 1 3 2.0
## 19 169.9 2642.857 1 3 2.0
## 20 125.0 2166.667 1 4 2.0
## 21 134.9 2166.667 1 4 2.0
## 22 139.9 2547.619 1 4 2.0
## 23 147.0 2642.857 1 4 2.0
## 24 159.0 2261.905 1 4 2.0
## 25 169.9 2547.619 2 4 3.0
## 26 178.9 2738.095 1 4 2.0
## 27 194.5 2833.333 2 4 3.0
## 28 219.9 2928.571 1 4 2.5
## 29 269.0 3309.524 2 4 3.0
Let’s go through the list of predictors one by one:
splot(Price, Sqfeet, add.line = 1)
cor(Price, Sqfeet)
## [1] 0.9152079
strong positive relationship (r=0.915), could be linear
bplot(Price, Floors)
cor(Price, Floors)
## [1] 0.2910762
weak if any positive relationship (r=0.291)
Note we used the boxplot here although Floors is a quantitative predictor. If the predictor has only a few different values (2 here!) this is often a better choice.
bplot(Price, Bedrooms)
cor(Price, Bedrooms)
## [1] 0.6045036
strong positive relationship (r=0.605), could be linear
bplot(Price, Baths)
cor(Price, Baths)
## [1] 0.6525626
strong positive relationship (r=0.652), could be linear
Note there is so far no mention of regression, residual vs. fits plot or normal plot. Making decisions about possible transformations and/or polynomial models early solely based on scatterplots and/or boxplots is usually a good idea.
Now to run the regression we have the routine mlr. As always the first argument is the response variable but now the second argument are all the predictors as a matrix:
mlr(Price, houseprice[, -1])
## The least squares regression equation is:
## Price = -67.62 + 0.086 Sqfeet - 26.493 Floors - 9.286 Bedrooms + 37.381 Baths
## R^2 = 88.6%
For the assumptions there is nothing new, as before we need to check the residual vs. fits plot and the normal plot of residuals.
This appears to be a good model and the assumptions of normally distributed residuals with equal variance appears to be o.k.
Except,
Notice that there is something very strange about this model!
round(cor(houseprice[,-1]),3)
## Sqfeet Floors Bedrooms Baths
## Sqfeet 1.000 0.370 0.652 0.628
## Floors 0.370 1.000 -0.018 0.743
## Bedrooms 0.652 -0.018 1.000 0.415
## Baths 0.628 0.743 0.415 1.000
The highest correlation between predictors is r=0.743 (Floors-Baths)
We have previously talked about the fact that we want our models to be as simple as possible. Often that means a model with as few predictors as possible. So the question becomes:
Can we eliminate any of our predictors without making the model (stat. signif.) worse?
There are several things one can think of:
Choose based on R2
but we already know this will always lead to the model with all predictors, for the same reason that a cubic model always has an R2 at least as high as the quadratic model.
Note:
Price by Sqfeet, Floors and Bedrooms: R2=80.1%
Price by Floors, Bedrooms and Baths: R2=68.4%
Price by Sqfeet, Bedrooms and Baths: R2=83.5%
Price by Sqfeet, Floors, Bedrooms and Baths: R2=88.2%
so model with all 4 has a higher R2 than any of the models with just 3, but this will always be so, even if one of the predictors is completely useless.
Choose based on Hypothesis Tests
mlr(Price, houseprice[, -1], show.tests = TRUE)
## The least squares regression equation is:
## Price = -67.62 + 0.086 Sqfeet - 26.493 Floors - 9.286 Bedrooms + 37.381 Baths
## Variable p value
## Constant 0.000
## Sqfeet 0.000
## Floors 0.0104
## Bedrooms 0.1871
## Baths 0.0057
## R^2 = 88.6%
so p_value of Bedrooms > 0.05, so eliminate Bedrooms.
This sounds like a good idea AND IT IS WIDELY USED IN REAL LIFE, but it turns out to be a bad one ! The reason why is bit hard to explain, though.
Use nested models test
all.preds <- mlr(Price, houseprice[, -1], return.model = TRUE)
without.bedrooms <- mlr(Price, houseprice[, -c(1, 4)],
return.model = TRUE)
nested.models.test(all.preds, without.bedrooms)
## H0: both models are equally good.
## p value= 0.1871
Again, this sounds like a good idea AND AGAIN IT IS WIDELY USED IN REAL LIFE, but it turns out to be a dangerous one! To start, if we have several predictors we might want to eliminate, we immediately face the issue of simultaneous inference.
There are several methods in wide use that are essentially based on this idea, such as forward selection, backward selection and stepwise regression. These are sometimes unavoidable but need to be done with great care!
What we need is new idea:
Best Subset Regression and Mallow’s Cp
We will find ALL possible models and calculate Mallow’s Cp statistic for each. The model with the lowest Cp is best.
mallows(Price, houseprice[, -1])
## Number of Variables Cp Sqfeet Floors Bedrooms Baths
## 1 8.83 X
## 2 8.81 X X
## 3 4.85 X X X
## 4 5 X X X X
so the best model uses Sqfeet, Floors and Baths.
To find the model we rerun mlr, now without Bedrooms:
mlr(Price, houseprice[, -c(1,4)])
## The least squares regression equation is:
## Price = -73.93 + 0.078 Sqfeet - 19.68 Floors + 30.579 Baths
## R^2 = 87.7%
Note that the model with all four predictors has Cp=5.0. But Cp is a statistic, its exact value depends on the sample. So is the model with Sqfeet, Floors and Baths statistically significantly better than the model with all four predictors? We would need a hypothesis test to answer this question but this is not part of our course.
For more on Mallow’s Cp see page 603 of the textbook.
Prediction works just as it did for simple regression. We have the command mlr.predict. Say we want to find a 90% interval estimate for a house that has 2000 sqfeet, one floor and two baths. Then
mlr.predict(Price, houseprice[ ,-c(1, 4)],
newx=c(2000, 1, 2), interval="PI", conf.level=90)
## Sqfeet Floors Baths Fit Lower Upper
## 2000 1 2 122.98 98.07 147.89
and so a 90% predicition interval is ($98,070, $147,890)
If we want to do prediction for a number of cases newx has to be a matrix:
newx <- cbind(c(2000,2100,2200), rep(1,3), rep(2,3))
newx
## [,1] [,2] [,3]
## [1,] 2000 1 2
## [2,] 2100 1 2
## [3,] 2200 1 2
mlr.predict(Price, houseprice[ ,-c(1, 4)],
newx=newx, interval="PI")
## Sqfeet Floors Baths Fit Lower Upper
## 2000 1 2 122.98 92.93 153.03
## 2100 1 2 130.75 100.96 160.54
## 2200 1 2 138.52 108.86 168.18
Finally, not including the newx does the prediction of the dataset:
head(mlr.predict(Price, houseprice[ ,-c(1, 4)],
interval="PI"))
## Sqfeet Floors Baths Fit Lower Upper
## 1500.000 1 1 53.54 18.63 88.46
## 1880.952 1 2 113.73 83.23 144.23
## 1976.190 1 2 121.13 91.01 151.25
## 1880.952 1 2 113.73 83.23 144.23
## 1880.952 1 2 113.73 83.23 144.23
## 1976.190 1 2 121.13 91.01 151.25
The dependent variable for analysis is age adjusted mortality (called “Mortality”). The data include variables measuring demographic characteristics of the cities, variables measuring climate characteristics, and variables recording the pollution potential of three different air pollutants.
attach(airpollution)
head(airpollution)
## Mortality JanTemp JulyTemp RelHum Rain Education
## AkronOH 921.87 27 71 59 36 11.4
## Albany-Schenectady-TroyNY 997.87 23 72 57 35 11.0
## AllentownBethlehemPA-NJ 962.35 29 74 54 44 9.8
## AtlantaGA 982.29 45 79 56 47 11.1
## BaltimoreMD 1071.29 35 77 55 43 9.6
## BirminghamAL 1030.38 45 80 54 53 10.2
## PopDensity NonWhite WhiteCollar Pop
## AkronOH 3243 8.8 42.6 660328
## Albany-Schenectady-TroyNY 4281 3.5 50.7 835880
## AllentownBethlehemPA-NJ 4260 0.8 39.4 635481
## AtlantaGA 3125 27.1 50.2 2138231
## BaltimoreMD 6441 24.4 43.7 2199531
## BirminghamAL 3325 38.5 43.1 883946
## Pop.House Income HCPot NOxPot SO2Pot NOx
## AkronOH 3.34 29560 21 15 59 15
## Albany-Schenectady-TroyNY 3.14 31458 8 10 39 10
## AllentownBethlehemPA-NJ 3.21 31856 6 6 33 6
## AtlantaGA 3.41 32452 18 8 24 8
## BaltimoreMD 3.44 32368 43 38 206 38
## BirminghamAL 3.45 27835 30 32 72 32
next we want to look at the scatterplots and the correlations. There are 15 predictors, so there are 15 graphs and correlations.
## Correlation: -0.016
## Correlation: 0.322
## Correlation: -0.101
## Correlation: 0.433
## Correlation: -0.508
## Correlation: 0.252
## Correlation: 0.647
## Correlation: -0.289
## Outliers!
## Correlation: 0.368
## Correlation: -0.283
## Outliers!
## Outliers!
## Correlation: 0.419
## Outliers!
There are problems with four predictors (Pop, HCPot, NOx, and NOxPot), Let’s try the log transform and check again for those predictors:
The easiest way to do this is to make a new matrix:
detach(airpollution)
newair <- airpollution
newair[ ,c("Pop", "HCPot", "NOx", "NOxPot")] <- log(newair[, c("Pop", "HCPot", "NOx", "NOxPot")])
colnames(newair)[c(10, 13, 14, 16)] <- c("log(Pop)", "log(HCPot)", "log(NOx)", "log(NOxPot)")
attach(newair)
for(i in c(10, 13, 14, 16)) {
splot(Mortality, newair[, i], add.line=1,
label_x = colnames(newair)[i])
cat("Correlation: ",
round(cor(Mortality, newair[, i]), 3))
}
## Correlation: 0.085
## Correlation: 0.125
## Correlation: 0.28
## Correlation: 0.28
so in all cases the log transform worked, and we will use newair from now on.
Let’s find the correlations in absolute value of the predictors with the response, in order:
cors <- round(cor(newair), 2)
sort(abs(cors[ ,"Mortality"]), decreasing = TRUE)[-1]
## NonWhite Education Rain SO2Pot Pop.House JulyTemp
## 0.65 0.51 0.43 0.42 0.37 0.32
## WhiteCollar Income log(NOx) log(NOxPot) PopDensity log(HCPot)
## 0.29 0.28 0.28 0.28 0.25 0.13
## RelHum log(Pop) JanTemp
## 0.10 0.09 0.02
Next we look at the correlations between the predictors.
cors[-1, -1]
## JanTemp JulyTemp RelHum Rain Education PopDensity NonWhite
## JanTemp 1.00 0.32 0.09 0.06 0.11 -0.08 0.46
## JulyTemp 0.32 1.00 -0.44 0.47 -0.27 -0.01 0.60
## RelHum 0.09 -0.44 1.00 -0.12 0.19 -0.15 -0.12
## Rain 0.06 0.47 -0.12 1.00 -0.47 0.08 0.30
## Education 0.11 -0.27 0.19 -0.47 1.00 -0.24 -0.21
## PopDensity -0.08 -0.01 -0.15 0.08 -0.24 1.00 -0.01
## NonWhite 0.46 0.60 -0.12 0.30 -0.21 -0.01 1.00
## WhiteCollar 0.21 -0.01 0.01 -0.11 0.49 0.25 -0.06
## log(Pop) 0.32 0.04 -0.02 -0.28 0.27 0.21 0.22
## Pop.House -0.33 0.26 -0.14 0.20 -0.39 -0.17 0.35
## Income 0.20 -0.19 0.13 -0.36 0.51 0.00 -0.10
## log(HCPot) 0.23 -0.41 0.18 -0.48 0.18 0.26 0.15
## log(NOx) 0.18 -0.30 0.10 -0.39 0.03 0.34 0.21
## SO2Pot -0.09 -0.07 -0.12 -0.13 -0.23 0.42 0.16
## log(NOxPot) 0.18 -0.30 0.10 -0.39 0.03 0.34 0.21
## WhiteCollar log(Pop) Pop.House Income log(HCPot) log(NOx)
## JanTemp 0.21 0.32 -0.33 0.20 0.23 0.18
## JulyTemp -0.01 0.04 0.26 -0.19 -0.41 -0.30
## RelHum 0.01 -0.02 -0.14 0.13 0.18 0.10
## Rain -0.11 -0.28 0.20 -0.36 -0.48 -0.39
## Education 0.49 0.27 -0.39 0.51 0.18 0.03
## PopDensity 0.25 0.21 -0.17 0.00 0.26 0.34
## NonWhite -0.06 0.22 0.35 -0.10 0.15 0.21
## WhiteCollar 1.00 0.28 -0.35 0.37 0.16 0.11
## log(Pop) 0.28 1.00 -0.26 0.41 0.48 0.50
## Pop.House -0.35 -0.26 1.00 -0.30 -0.22 -0.12
## Income 0.37 0.41 -0.30 1.00 0.29 0.25
## log(HCPot) 0.16 0.48 -0.22 0.29 1.00 0.94
## log(NOx) 0.11 0.50 -0.12 0.25 0.94 1.00
## SO2Pot -0.06 0.37 -0.01 0.07 0.57 0.68
## log(NOxPot) 0.11 0.50 -0.12 0.25 0.94 1.00
## SO2Pot log(NOxPot)
## JanTemp -0.09 0.18
## JulyTemp -0.07 -0.30
## RelHum -0.12 0.10
## Rain -0.13 -0.39
## Education -0.23 0.03
## PopDensity 0.42 0.34
## NonWhite 0.16 0.21
## WhiteCollar -0.06 0.11
## log(Pop) 0.37 0.50
## Pop.House -0.01 -0.12
## Income 0.07 0.25
## log(HCPot) 0.57 0.94
## log(NOx) 0.68 1.00
## SO2Pot 1.00 0.68
## log(NOxPot) 0.68 1.00
We find:
Because of this unterpreting (understanding) the final model will be difficult.
Using perfectly correlated predictors is not possible so we eliminate one of them, say log(NOx):
detach(newair)
newair <- newair[, -16]
attach(newair)
Next we fit a model with all the predictors and check the assumptions:
mlr(Mortality , newair[, -1] )
## The least squares regression equation is:
## Mortality = 1230.17 - 1.885 JanTemp - 1.793 JulyTemp + 0.532 RelHum + 1.414 Rain - 7.569 Education + 0.004 PopDensity + 5.012 NonWhite - 1.796 WhiteCollar + 2.828 log(Pop) - 34.656 Pop.House - 0.001 Income - 22.638 log(HCPot) + 27.848 log(NOx) + 0.091 SO2Pot
## R^2 = 77%
The residual vs fits plot looks fine, so there is no problem with the model.
The normal plot is ok, so no problem with the nomal assumption. The residual vs fits plot looks fine, so there is no problem with the equal variance assumption.
Next we use the best subset regression to see whether we can find a model with fewer predictors.
mallows(Mortality, newair[, -1])
## Number of Variables Cp JanTemp JulyTemp RelHum Rain Education
## 1 56.46
## 2 30.59 X
## 3 18.61 X X
## 4 9.07 X X
## 5 4.98 X X X
## 6 4.32 X X
## 7 4.9 X X
## 8 6.05 X X X
## 9 6.98 X X X
## 10 8.14 X X X X
## 11 9.55 X X X X
## 12 11.29 X X X X X
## 13 13.14 X X X X X
## 14 15 X X X X X
## PopDensity NonWhite WhiteCollar log(Pop) Pop.House Income log(HCPot)
## X
## X
## X
## X
## X
## X X X
## X X X X
## X X X X
## X X X X
## X X X X X
## X X X X X
## X X X X X
## X X X X X X
## X X X X X X X
## log(NOx) SO2Pot
##
##
##
## X
## X
## X
## X
## X
## X X
## X
## X X
## X X
## X X
## X X
It suggests a model based on JanTemp, Rain, PopDensity, NonWhite, WhiteCollar and LOGT(NOx) with Mallow’s Cp=4.32
mlr(Mortality,
newair[, c("JanTemp", "Rain", "PopDensity", "NonWhite",
"WhiteCollar", "log(NOx)")])
## The least squares regression equation is:
## Mortality = 944.275 - 1.942 JanTemp + 1.924 Rain + 0.006 PopDensity + 4.194 NonWhite - 2.723 WhiteCollar + 17 log(NOx)
## R^2 = 74.2%
Because the best model does still include one of the pollution variables, we can conclude that pollution adds to the mortality rate.
And we are done!