Subset Selection and Ridge Regression

Subset Selection

We have previously discussed the issue of subset selection. There we use Mallow’s C_p Statistic to find the best model. This calculates all possible models, and if there are k predictors there are \(2^k\) such models. Although there are very fast algorithms available for this, it is not feasible to do it for much more than 30 predictors. So what do we do if we have more than that?

Forward/Backward Selection

One idea is as follows:

Fit the model with no predictors.
Find which predictor improves the model the most (somehow)
If this predictor improves the fit statistically significantly, add it and go back to 2.
Stop

There are routines in R to do these steps.

Example: Pollution and Mortality

First we need to fix the non-normal predictors:

newair <- airpollution[, -16] #take out NOxPot
newair[, c(10, 13, 14)] <- log(newair[, c(10, 13, 14)])
colnames(newair)[c(10, 13, 14)] <- 
  c("log.Pop", "log.HCPot", "log.NOx")

Next we fit the model with no predictor. Of course that just finds the mean of Mortality, but we need the corresponding lm object.

fit <- lm(Mortality~1, data=newair)

How do we decide which predictor (if any) to add to the model? We can use the add1 command and the so called \(F\) statistic:

add1(fit, formula(newair), test="F")

## Single term additions
## 
## Model:
## Mortality ~ 1
##             Df Sum of Sq    RSS    AIC F value    Pr(>F)
## <none>                   225993 488.79                  
## JanTemp      1        58 225935 490.78  0.0145 0.9045520
## JulyTemp     1     23407 202586 484.34  6.5858 0.0129321
## RelHum       1      2309 223684 490.19  0.5883 0.4462331
## Rain         1     42393 183599 478.54 13.1614 0.0006117
## Education    1     58340 167652 473.17 19.8352 3.991e-05
## PopDensity   1     14365 211627 486.92  3.8691 0.0540559
## NonWhite     1     94473 131520 458.85 40.9440 3.172e-08
## WhiteCollar  1     18920 207072 485.63  5.2081 0.0262337
## log.Pop      1      1646 224347 490.36  0.4182 0.5204471
## Pop.House    1     30608 195385 482.21  8.9292 0.0041348
## Income       1     18138 207855 485.86  4.9739 0.0296867
## log.HCPot    1      3553 222440 489.86  0.9103 0.3440525
## log.NOx      1     17696 208296 485.98  4.8425 0.0318330
## SO2Pot       1     39698 186295 479.40 12.1462 0.0009533

so the predictor with the highest F statistics (40.9) is NonWhite and it is statistically significant (p=0.000), so we add it to the list of predictors:

fit <- update(fit, .~.+NonWhite)
coef(fit)

## (Intercept)    NonWhite 
##  887.901783    4.485521

Here and in what follows I use the \(F\) statistic as a criterion. There are others, some included in the add1 routine such as AIC or Akaike’s information criterion. Which criterion to use is a rather tricky question.

tmp <- add1(fit, formula(newair), test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==max(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## Education , F =  18.66899 , p value =  6.428038e-05

fit <- update(fit, .~.+Education)
tmp <- add1(fit, formula(newair), test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==max(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## JanTemp , F =  11.04183 , p value =  0.001588636

fit <- update(fit, .~.+JanTemp)
tmp <- add1(fit, formula(newair), test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==max(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## SO2Pot , F =  7.428167 , p value =  0.008637386

fit <- update(fit, .~.+SO2Pot)
tmp <- add1(fit, formula(newair), test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==max(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## Rain , F =  5.21367 , p value =  0.02644668

fit <- update(fit, .~.+Rain)
tmp <- add1(fit, formula(newair), test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==max(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## log.NOx , F =  4.872501 , p value =  0.03172464

fit <- update(fit, .~.+log.NOx)
tmp <- add1(fit, formula(newair), test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==max(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## WhiteCollar , F =  1.959068 , p value =  0.1676667

and the next predictor is not stat. significant, so we stop.

Notice that this is not the same model that we found using Mallow’s C_p, which used JanTemp, Rain, PopDensity, NonWhite, WhiteCollar and LOGT(NOx).

An alternative to forward selection is its reverse, backward selection. Here we start with the full model and remove predictors until there are only significant predictors left:

Here is the backward solution:

fit <- lm(Mortality~., data=newair)
drop1(fit, test="F")

## Single term deletions
## 
## Model:
## Mortality ~ JanTemp + JulyTemp + RelHum + Rain + Education + 
##     PopDensity + NonWhite + WhiteCollar + log.Pop + Pop.House + 
##     Income + log.HCPot + log.NOx + SO2Pot
##             Df Sum of Sq   RSS    AIC F value    Pr(>F)
## <none>                   51920 430.02                  
## JanTemp      1      7767 59687 436.24  6.5821   0.01379
## JulyTemp     1       946 52867 429.08  0.8018   0.37542
## RelHum       1       325 52246 428.38  0.2757   0.60215
## Rain         1      7070 58990 435.55  5.9912   0.01844
## Education    1       853 52773 428.98  0.7228   0.39984
## PopDensity   1      1112 53032 429.27  0.9422   0.33701
## NonWhite     1     31909 83830 456.28 27.0417 4.965e-06
## WhiteCollar  1      2637 54558 430.94  2.2348   0.14207
## log.Pop      1       162 52082 428.20  0.1369   0.71313
## Pop.House    1       991 52911 429.13  0.8394   0.36456
## Income       1       245 52165 428.29  0.2076   0.65087
## log.HCPot    1      2636 54557 430.94  2.2342   0.14213
## log.NOx      1      4750 56670 433.18  4.0253   0.05099
## SO2Pot       1       690 52611 428.80  0.5850   0.44843

now the predictor with the smallest F value is log.Pop, so we drop it:

fit <- update(fit, .~.-log.Pop)
tmp <- drop1(fit, test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==min(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## Income , F =  0.1531485 , p value =  0.6973914

fit <- update(fit, .~.-Income)
tmp <- drop1(fit, test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==min(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## RelHum , F =  0.272645 , p value =  0.6040685

fit <- update(fit, .~.-RelHum)
tmp <- drop1(fit, test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==min(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## SO2Pot , F =  0.6219361 , p value =  0.4342887

fit <- update(fit, .~.-SO2Pot)
tmp <- drop1(fit, test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==min(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## JulyTemp , F =  1.110323 , p value =  0.2972872

fit <- update(fit, .~.-JulyTemp)
tmp <- drop1(fit, test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==min(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## PopDensity , F =  0.9241217 , p value =  0.3411151

fit <- update(fit, .~.-PopDensity)
tmp <- drop1(fit, test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==min(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## log.HCPot , F =  1.562185 , p value =  0.2171643

fit <- update(fit, .~.-log.HCPot)
tmp <- drop1(fit, test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==min(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## Pop.House , F =  1.678198 , p value =  0.2009972

fit <- update(fit, .~.-Pop.House)
tmp <- drop1(fit, test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==min(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## WhiteCollar , F =  1.928731 , p value =  0.1708173

fit <- update(fit, .~.-WhiteCollar)
tmp <- drop1(fit, test="F")
k <- seq_along(tmp[[5]][-1])[tmp[[5]][-1] ==min(tmp[[5]][-1])]
cat(rownames(tmp)[k+1], ", F = ", tmp[[5]][k+1], ", p value = ", tmp[[6]][k+1],"\n")

## Education , F =  6.21418 , p value =  0.01583229

and now the p value is less than 0.05, so we stop.

This results in a model with predictors

rownames(tmp)[-1]

## [1] "JanTemp"   "Rain"      "Education" "NonWhite"  "log.NOx"

which is not the same as either best subset or forward selection.

stepwise selection

Here in each step we either add or drop a variable. The step command does it for us. Notice that it uses AIC by default.

fit <- lm(Mortality~., data=newair)
step(fit)

## Start:  AIC=430.02
## Mortality ~ JanTemp + JulyTemp + RelHum + Rain + Education + 
##     PopDensity + NonWhite + WhiteCollar + log.Pop + Pop.House + 
##     Income + log.HCPot + log.NOx + SO2Pot
## 
##               Df Sum of Sq   RSS    AIC
## - log.Pop      1       162 52082 428.20
## - Income       1       245 52165 428.29
## - RelHum       1       325 52246 428.38
## - SO2Pot       1       690 52611 428.80
## - Education    1       853 52773 428.98
## - JulyTemp     1       946 52867 429.08
## - Pop.House    1       991 52911 429.13
## - PopDensity   1      1112 53032 429.27
## <none>                     51920 430.02
## - log.HCPot    1      2636 54557 430.94
## - WhiteCollar  1      2637 54558 430.94
## - log.NOx      1      4750 56670 433.18
## - Rain         1      7070 58990 435.55
## - JanTemp      1      7767 59687 436.24
## - NonWhite     1     31909 83830 456.28
## 
## Step:  AIC=428.2
## Mortality ~ JanTemp + JulyTemp + RelHum + Rain + Education + 
##     PopDensity + NonWhite + WhiteCollar + Pop.House + Income + 
##     log.HCPot + log.NOx + SO2Pot
## 
##               Df Sum of Sq   RSS    AIC
## - Income       1       177 52259 426.40
## - RelHum       1       324 52406 426.57
## - Education    1       755 52837 427.05
## - SO2Pot       1       810 52892 427.11
## - JulyTemp     1       865 52947 427.17
## - Pop.House    1      1098 53180 427.43
## - PopDensity   1      1206 53288 427.55
## <none>                     52082 428.20
## - WhiteCollar  1      2622 54704 429.10
## - log.HCPot    1      2731 54813 429.21
## - log.NOx      1      5135 57217 431.75
## - Rain         1      6912 58994 433.55
## - JanTemp      1      7637 59719 434.27
## - NonWhite     1     32637 84719 454.90
## 
## Step:  AIC=426.4
## Mortality ~ JanTemp + JulyTemp + RelHum + Rain + Education + 
##     PopDensity + NonWhite + WhiteCollar + Pop.House + log.HCPot + 
##     log.NOx + SO2Pot
## 
##               Df Sum of Sq   RSS    AIC
## - RelHum       1       310 52569 424.75
## - SO2Pot       1       740 52999 425.23
## - JulyTemp     1       848 53107 425.35
## - Education    1      1110 53370 425.64
## - Pop.House    1      1128 53387 425.66
## - PopDensity   1      1198 53457 425.74
## <none>                     52259 426.40
## - log.HCPot    1      2631 54890 427.30
## - WhiteCollar  1      2825 55084 427.51
## - log.NOx      1      4995 57254 429.79
## - Rain         1      7196 59455 432.01
## - JanTemp      1      8281 60540 433.08
## - NonWhite     1     33147 85406 453.38
## 
## Step:  AIC=424.75
## Mortality ~ JanTemp + JulyTemp + Rain + Education + PopDensity + 
##     NonWhite + WhiteCollar + Pop.House + log.HCPot + log.NOx + 
##     SO2Pot
## 
##               Df Sum of Sq   RSS    AIC
## - SO2Pot       1       696 53265 423.52
## - Education    1      1066 53635 423.93
## - PopDensity   1      1116 53685 423.99
## - Pop.House    1      1128 53697 424.00
## - JulyTemp     1      1635 54204 424.55
## <none>                     52569 424.75
## - log.HCPot    1      2631 55200 425.63
## - WhiteCollar  1      2844 55413 425.86
## - log.NOx      1      4934 57503 428.04
## - Rain         1      7628 60197 430.74
## - JanTemp      1      7983 60552 431.09
## - NonWhite     1     34608 87177 452.59
## 
## Step:  AIC=423.52
## Mortality ~ JanTemp + JulyTemp + Rain + Education + PopDensity + 
##     NonWhite + WhiteCollar + Pop.House + log.HCPot + log.NOx
## 
##               Df Sum of Sq   RSS    AIC
## - JulyTemp     1      1232 54497 422.87
## - PopDensity   1      1331 54595 422.98
## - Education    1      1350 54615 423.00
## - Pop.House    1      1540 54804 423.21
## <none>                     53265 423.52
## - log.HCPot    1      2651 55916 424.39
## - WhiteCollar  1      3186 56451 424.95
## - log.NOx      1      7406 60671 429.21
## - Rain         1      7719 60984 429.51
## - JanTemp      1     11176 64441 432.76
## - NonWhite     1     34745 88009 451.15
## 
## Step:  AIC=422.87
## Mortality ~ JanTemp + Rain + Education + PopDensity + NonWhite + 
##     WhiteCollar + Pop.House + log.HCPot + log.NOx
## 
##               Df Sum of Sq   RSS    AIC
## - PopDensity   1      1028 55525 421.98
## - Education    1      1146 55643 422.10
## - Pop.House    1      1442 55939 422.41
## - log.HCPot    1      1609 56106 422.59
## <none>                     54497 422.87
## - WhiteCollar  1      3737 58234 424.79
## - log.NOx      1      6565 61062 427.58
## - Rain         1      8311 62808 429.25
## - JanTemp      1     13524 68021 433.95
## - NonWhite     1     41036 95533 453.99
## 
## Step:  AIC=421.98
## Mortality ~ JanTemp + Rain + Education + NonWhite + WhiteCollar + 
##     Pop.House + log.HCPot + log.NOx
## 
##               Df Sum of Sq   RSS    AIC
## - log.HCPot    1      1735 57259 421.79
## <none>                     55525 421.98
## - Pop.House    1      2373 57898 422.44
## - Education    1      2558 58082 422.63
## - WhiteCollar  1      2763 58288 422.84
## - log.NOx      1      7827 63351 427.76
## - Rain         1      8886 64410 428.73
## - JanTemp      1     16942 72466 435.69
## - NonWhite     1     41873 97398 453.13
## 
## Step:  AIC=421.79
## Mortality ~ JanTemp + Rain + Education + NonWhite + WhiteCollar + 
##     Pop.House + log.NOx
## 
##               Df Sum of Sq   RSS    AIC
## - Pop.House    1      1884 59143 421.70
## <none>                     57259 421.79
## - WhiteCollar  1      2609 59868 422.42
## - Education    1      3574 60834 423.36
## - Rain         1     11398 68658 430.50
## - log.NOx      1     16778 74037 434.95
## - JanTemp      1     18282 75541 436.14
## - NonWhite     1     41479 98739 451.94
## 
## Step:  AIC=421.7
## Mortality ~ JanTemp + Rain + Education + NonWhite + WhiteCollar + 
##     log.NOx
## 
##               Df Sum of Sq    RSS    AIC
## <none>                      59143 421.70
## - WhiteCollar  1      2194  61337 421.85
## - Education    1      2621  61764 422.26
## - Rain         1     13263  72406 431.64
## - JanTemp      1     17593  76736 435.07
## - log.NOx      1     20607  79750 437.34
## - NonWhite     1     48049 107192 454.79

## 
## Call:
## lm(formula = Mortality ~ JanTemp + Rain + Education + NonWhite + 
##     WhiteCollar + log.NOx, data = newair)
## 
## Coefficients:
## (Intercept)      JanTemp         Rain    Education     NonWhite  
##    1031.949       -2.023        1.812      -10.746        4.040  
## WhiteCollar      log.NOx  
##      -1.451       19.248

this seems the easiest to use (certainly in terms of what we have to type into R) but it is important to understand that none of these methods works all the time. For each of them there are examples were they lead to quite bad final models.

There has been a lot of research on the relative merits of these methods. There are in fact many Statisticians who advise against their use. As an alternative we can consider

Ridge Regression

One problem with the above methods is that they are all or nothing: a predictor either is in the final model or is not. Ridge regression takes a different approach: each variable gets a “weight” in how much it contributes to the final model.

Recall the least squares regression method: we minimize the least squares criterion

\[ \sum_{i=1}^n \left( y_i - \beta_0 -\sum_{j-1}^k \beta_jx_{ij} \right)^2 \]

in ridge regression we use the criterion

\[ \sum_{i=1}^n \left( y_i - \beta_0 -\sum_{j-1}^p \beta_jx_{ij}\right)^2 + \lambda \sum_{j-1}^p \beta_j^2 \]

What does this do? The term \(\sum_{j-1}^p \beta_j^2\) will depend mainly on the largest \(\beta\)’s. Because this term is added in and we are minimizing this expression we essentially penalize large \(\beta\)’s. Overall these coefficients will be shrunk towards 0. For this reason ridge regression is a shrinkage method. Such method have become quite popular in many areas of Statistics.

In the literature such methods are also refereed to as penalized likelihood methods.

\(\lambda\) is a parameter that controls the amount of shrinkage. If \(\lambda=0\) we are back at OLS.

How do we fit such a model?

library(ridge)
fit <- linearRidge(Mortality~., data=newair)
summary(fit)

## 
## Call:
## linearRidge(formula = Mortality ~ ., data = newair)
## 
## 
## Coefficients:
##               Estimate Scaled estimate Std. Error (scaled)
## (Intercept)  8.750e+02              NA                  NA
## JanTemp     -9.412e-02      -7.277e+00           1.082e+01
## JulyTemp     6.604e-01       2.314e+01           9.865e+00
## RelHum      -4.164e-02      -1.706e+00           1.161e+01
## Rain         4.772e-01       4.206e+01           1.056e+01
## Education   -6.277e+00      -4.067e+01           1.023e+01
## PopDensity   2.059e-03       2.261e+01           1.121e+01
## NonWhite     9.271e-01       6.353e+01           1.025e+01
## WhiteCollar -6.699e-01      -2.586e+01           1.113e+01
## log.Pop      1.466e+00       9.127e+00           1.057e+01
## Pop.House    2.177e+01       3.032e+01           1.094e+01
## Income      -6.439e-04      -2.194e+01           1.096e+01
## log.HCPot    1.458e+00       1.253e+01           8.524e+00
## log.NOx      3.090e+00       2.796e+01           8.551e+00
## SO2Pot       7.871e-02       3.810e+01           1.024e+01
##             t value (scaled) Pr(>|t|)
## (Intercept)               NA       NA
## JanTemp                0.672 0.501310
## JulyTemp               2.346 0.018970
## RelHum                 0.147 0.883210
## Rain                   3.982 6.84e-05
## Education              3.976 7.01e-05
## PopDensity             2.016 0.043760
## NonWhite               6.198 5.73e-10
## WhiteCollar            2.325 0.020082
## log.Pop                0.864 0.387804
## Pop.House              2.770 0.005599
## Income                 2.001 0.045437
## log.HCPot              1.470 0.141584
## log.NOx                3.270 0.001077
## SO2Pot                 3.720 0.000199
## 
## Ridge parameter: 2.861264, chosen automatically, computed using 1 PCs
## 
## Degrees of freedom: model 2.995 , variance 1.031 , residual 4.959

Lasso

This is similar to ridge regression but it uses

\[ \sum_{i=1}^n \left( y_i - \beta_0 -\sum_{j-1}^p \beta_jx_{ij}\right)^2 + \lambda \sum_{j-1}^p |\beta_j| \]

In modern terminology, the Lasso uses an \(L^1\) penalty whereas ridge regression uses \(L^2\).

The Lasso can be fit with

library(glmnet)
X <- data.matrix(newair[, -1])
y <- newair$Mortality
fit <- cv.glmnet(X, y, standardize=TRUE,
                type.measure="mse", nfolds = 5, alpha=1)
plot(fit)

cf <- as.numeric(coef(fit, s=fit$lambda.1se))
names(cf) <- c("Intercept", colnames(X))
cf

##    Intercept      JanTemp     JulyTemp       RelHum         Rain 
## 1032.9753025    0.0000000    0.0000000    0.0000000    0.3963679 
##    Education   PopDensity     NonWhite  WhiteCollar      log.Pop 
##  -13.3991939    0.0000000    2.6816158    0.0000000    0.0000000 
##    Pop.House       Income    log.HCPot      log.NOx       SO2Pot 
##    0.0000000    0.0000000    0.0000000    0.0000000    0.1467875

One advantage of the lasso is that it can yield coefficients that are 0, and clearly any predictor whose coefficient is 0 can be dropped:

cf[abs(cf)>0]

##    Intercept         Rain    Education     NonWhite       SO2Pot 
## 1032.9753025    0.3963679  -13.3991939    2.6816158    0.1467875

As a very general guideline, if the goal is subset selection use the lasso. If the goal is prediction use ridge regression.