Generalized Additive Models

For a linear regression we have a dependent variable Y and a set of predictors \(x_1, .., x_n\) and a model of the form \[ Y = \alpha + \sum \beta_j x_j + \epsilon \] Generalized additive models extend this in two ways: first we replace the linear terms \(\beta_j x_j\) by non-linear functions, to get \[ Y = \alpha + \sum f(x_j;\beta_j) + \epsilon \]

Second we can take the same step as before in going from linear models to general linear models to fit problems where Y is a categorical variable.

A special case we have already discussed is where the functions are polynomials.

We can also ā€œmixā€ linear and generalized additive models. Consider \[ Y = \alpha + \beta_1 x_1+ f(x_2;\beta_2) + \epsilon \] here we have a model linear in X1 and additive in X2. Such a model is called ā€œsemiparametricā€.

Example: Oil in Rocks

We have measurements on four cross-sections of each of 12 oil-bearing rocks. The measurements are the end product of a complex image-analysis and represent the total area, total perimeter and a measure of ā€˜roundnessā€™ of the pores in the rock cross-section.

kable.nice(head(rock.oil))
area peri shape perm
4990 2791.90 0.0903296 6.3
7002 3892.60 0.1486220 6.3
7558 3930.66 0.1833120 6.3
7352 3869.32 0.1170630 6.3
7943 3948.54 0.1224170 17.1
7979 4010.15 0.1670450 17.1
ggplot(data=rock.oil, aes("", perm)) +
  geom_boxplot()

this looks a bit skewed, so we should try a log transform:

ggplot(data=rock.oil, aes("", log(perm))) +
  geom_boxplot()

and that looks better, so

rock.oil$perm <- log(rock.oil$perm)
colnames(rock.oil)[4] <- "log.perm"

Next the predictors:

pushViewport(viewport(layout = grid.layout(2, 2)))
print(ggplot(data=rock.oil, aes(area, log.perm)) +
           geom_point() + 
           geom_smooth(method = "lm", se=FALSE),
    vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=rock.oil, aes(peri, log.perm)) +
           geom_point() + 
           geom_smooth(method = "lm", se=FALSE),
    vp=viewport(layout.pos.row=1, layout.pos.col=2))
print(ggplot(data=rock.oil, aes(shape, log.perm)) +
           geom_point() + 
           geom_smooth(method = "lm", se=FALSE),
    vp=viewport(layout.pos.row=2, layout.pos.col=1))

we begin with a linear model:

fit.lin <- lm(log.perm~., data=rock.oil)
df <- data.frame(Residuals=resid(fit.lin), 
            Fits = fitted(fit.lin))
ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0)

and that is not so bad.

Next letā€™s fit the generalized additive model:

library(mgcv)
fit.gam <- gam(log.perm ~ s(area) + s(peri) + s(shape),
               data=rock.oil)

Notice the terms s() which means we are using splines.

Is this model better than the simple linear one? We can compare the two using ANOVA, done in

anova(fit.lin, fit.gam)
## Analysis of Variance Table
## 
## Model 1: log.perm ~ area + peri + shape
## Model 2: log.perm ~ s(area) + s(peri) + s(shape)
##   Res.Df    RSS      Df Sum of Sq      F Pr(>F)
## 1 44.000 31.949                                
## 2 43.598 31.230 0.40237   0.71914 2.4951  0.125

It appears the more complicated model is not actually better than the old one (p=0.125).

What is this new model? In

par(mfrow=c(2, 2))
plot(fit.gam, se = TRUE)

we see the fitted line plots, which do look fairly linear.

Example: Kyphosis

This data set is about Kyphosis, a spinal deformity in children that occurs after certain surgeries on the spine.

The variables are:

  1. Kyphosis: 1 if kyphosis is present, 0 otherwise.
  2. Age: age of the child in month.
  3. Number: the number of vertebrae involved in the spinal operation.
  4. Start: the beginning of the range of the vertebrae involved in the spinal operation.
kable.nice(head(kyphosis))
Kyphosis Age Number Start
absent 71 3 5
absent 158 3 14
present 128 4 5
absent 2 5 1
absent 1 4 15
absent 1 2 16

the goal is to predict whether a child will develop kyphosis. So this is a binary outcome, and we will use logistic regression.

Letā€™s begin with some box plots:

pushViewport(viewport(layout = grid.layout(2, 2)))
print(ggplot(data=kyphosis, aes(Kyphosis, Age)) +
           geom_boxplot(),
    vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=kyphosis, aes(Kyphosis, Number)) +
           geom_boxplot(),    
      vp=viewport(layout.pos.row=1, layout.pos.col=2))
print(ggplot(data=kyphosis, aes(Kyphosis, Start)) +
           geom_boxplot(),    
      vp=viewport(layout.pos.row=2, layout.pos.col=1))

so it seems all predictors are useful.

fit.glm <- glm(Kyphosis ~ ., family = binomial, 
               data = kyphosis)
summary(fit.glm)
## 
## Call:
## glm(formula = Kyphosis ~ ., family = binomial, data = kyphosis)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3124  -0.5484  -0.3632  -0.1659   2.1613  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.036934   1.449575  -1.405  0.15996
## Age          0.010930   0.006446   1.696  0.08996
## Number       0.410601   0.224861   1.826  0.06785
## Start       -0.206510   0.067699  -3.050  0.00229
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83.234  on 80  degrees of freedom
## Residual deviance: 61.380  on 77  degrees of freedom
## AIC: 69.38
## 
## Number of Fisher Scoring iterations: 5

which suggests that only Start is strongly predictive.

It is possible to show that Number is not very useful here (using anova), and we will continue with Age and Start:

fit.sa.gam <- gam(Kyphosis ~ s(Age) + s(Start), 
                 family = binomial, data = kyphosis)
par(mfrow = c(1, 2))
plot(fit.sa.gam, se = TRUE)

From this it seems a model quadratic in Age might work. For Start we see that its spline appears piecewise linear, flat up to about 12 and then with a negative slope. This also makes sense from the background of the data, because values of Start up to 12 correspond to the thoracic region of the spine and values greater than 12 belong to the lumbar region. We will therefore try and fit a model of the form

\[ f(x) = a+b(x-12)I_{[12, \infty)}(x) \] Notice that this model fits a continuous function. In R we can do this by including a term I((Start-12)*(Start>12)). The ā€˜Iā€™ is needed so R does not interpret the ā€™*ā€™ as meaning interaction. Comparing this with the gam model we see that this model is as good as the generalized additive one.

fit.sa.glm <- glm(Kyphosis ~ poly(Age, 2) + 
                    I((Start - 12) * (Start > 12)), 
                  family = binomial, data = kyphosis)
anova(fit.sa.gam, fit.sa.glm)
## Analysis of Deviance Table
## 
## Model 1: Kyphosis ~ s(Age) + s(Start)
## Model 2: Kyphosis ~ poly(Age, 2) + I((Start - 12) * (Start > 12))
##   Resid. Df Resid. Dev      Df Deviance
## 1    74.498     52.383                 
## 2    81.000     51.953 -6.5018   0.4301

What does this new model look like?

x <- seq(1, 200, length = 100)
y <-  c(5, 10, 13, 14)
xy <- expand.grid(x, y)
df <- data.frame(Age=xy[, 1], Start=xy[, 2])
df$Prob <- predict(fit.sa.glm, df, type = "response")
ggplot(df, aes(Age, Prob)) +
  geom_line() +
  facet_wrap(~Start)

where we can see that the highest risk of kyphosis is for children around age 100 month (aka 8 years) but it diminishes the higher up the Start is.