For a linear regression we have a dependent variable Y and a set of predictors \(x_1, .., x_n\) and a model of the form \[ Y = \alpha + \sum \beta_j x_j + \epsilon \] Generalized additive models extend this in two ways: first we replace the linear terms \(\beta_j x_j\) by non-linear functions, to get \[ Y = \alpha + \sum f(x_j;\beta_j) + \epsilon \]
Second we can take the same step as before in going from linear models to general linear models to fit problems where Y is a categorical variable.
A special case we have already discussed is where the functions are polynomials.
We can also āmixā linear and generalized additive models. Consider \[ Y = \alpha + \beta_1 x_1+ f(x_2;\beta_2) + \epsilon \] here we have a model linear in X1 and additive in X2. Such a model is called āsemiparametricā.
We have measurements on four cross-sections of each of 12 oil-bearing rocks. The measurements are the end product of a complex image-analysis and represent the total area, total perimeter and a measure of āroundnessā of the pores in the rock cross-section.
kable.nice(head(rock.oil))
area | peri | shape | perm |
---|---|---|---|
4990 | 2791.90 | 0.0903296 | 6.3 |
7002 | 3892.60 | 0.1486220 | 6.3 |
7558 | 3930.66 | 0.1833120 | 6.3 |
7352 | 3869.32 | 0.1170630 | 6.3 |
7943 | 3948.54 | 0.1224170 | 17.1 |
7979 | 4010.15 | 0.1670450 | 17.1 |
ggplot(data=rock.oil, aes("", perm)) +
geom_boxplot()
this looks a bit skewed, so we should try a log transform:
ggplot(data=rock.oil, aes("", log(perm))) +
geom_boxplot()
and that looks better, so
rock.oil$perm <- log(rock.oil$perm)
colnames(rock.oil)[4] <- "log.perm"
Next the predictors:
pushViewport(viewport(layout = grid.layout(2, 2)))
print(ggplot(data=rock.oil, aes(area, log.perm)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE),
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=rock.oil, aes(peri, log.perm)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE),
vp=viewport(layout.pos.row=1, layout.pos.col=2))
print(ggplot(data=rock.oil, aes(shape, log.perm)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE),
vp=viewport(layout.pos.row=2, layout.pos.col=1))
we begin with a linear model:
fit.lin <- lm(log.perm~., data=rock.oil)
df <- data.frame(Residuals=resid(fit.lin),
Fits = fitted(fit.lin))
ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
and that is not so bad.
Next letās fit the generalized additive model:
library(mgcv)
fit.gam <- gam(log.perm ~ s(area) + s(peri) + s(shape),
data=rock.oil)
Notice the terms s() which means we are using splines.
Is this model better than the simple linear one? We can compare the two using ANOVA, done in
anova(fit.lin, fit.gam)
## Analysis of Variance Table
##
## Model 1: log.perm ~ area + peri + shape
## Model 2: log.perm ~ s(area) + s(peri) + s(shape)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 44.000 31.949
## 2 43.598 31.230 0.40237 0.71914 2.4951 0.125
It appears the more complicated model is not actually better than the old one (p=0.125).
What is this new model? In
par(mfrow=c(2, 2))
plot(fit.gam, se = TRUE)
we see the fitted line plots, which do look fairly linear.
This data set is about Kyphosis, a spinal deformity in children that occurs after certain surgeries on the spine.
The variables are:
kable.nice(head(kyphosis))
Kyphosis | Age | Number | Start |
---|---|---|---|
absent | 71 | 3 | 5 |
absent | 158 | 3 | 14 |
present | 128 | 4 | 5 |
absent | 2 | 5 | 1 |
absent | 1 | 4 | 15 |
absent | 1 | 2 | 16 |
the goal is to predict whether a child will develop kyphosis. So this is a binary outcome, and we will use logistic regression.
Letās begin with some box plots:
pushViewport(viewport(layout = grid.layout(2, 2)))
print(ggplot(data=kyphosis, aes(Kyphosis, Age)) +
geom_boxplot(),
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=kyphosis, aes(Kyphosis, Number)) +
geom_boxplot(),
vp=viewport(layout.pos.row=1, layout.pos.col=2))
print(ggplot(data=kyphosis, aes(Kyphosis, Start)) +
geom_boxplot(),
vp=viewport(layout.pos.row=2, layout.pos.col=1))
so it seems all predictors are useful.
fit.glm <- glm(Kyphosis ~ ., family = binomial,
data = kyphosis)
summary(fit.glm)
##
## Call:
## glm(formula = Kyphosis ~ ., family = binomial, data = kyphosis)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3124 -0.5484 -0.3632 -0.1659 2.1613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.036934 1.449575 -1.405 0.15996
## Age 0.010930 0.006446 1.696 0.08996
## Number 0.410601 0.224861 1.826 0.06785
## Start -0.206510 0.067699 -3.050 0.00229
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 83.234 on 80 degrees of freedom
## Residual deviance: 61.380 on 77 degrees of freedom
## AIC: 69.38
##
## Number of Fisher Scoring iterations: 5
which suggests that only Start is strongly predictive.
It is possible to show that Number is not very useful here (using anova), and we will continue with Age and Start:
fit.sa.gam <- gam(Kyphosis ~ s(Age) + s(Start),
family = binomial, data = kyphosis)
par(mfrow = c(1, 2))
plot(fit.sa.gam, se = TRUE)
From this it seems a model quadratic in Age might work. For Start we see that its spline appears piecewise linear, flat up to about 12 and then with a negative slope. This also makes sense from the background of the data, because values of Start up to 12 correspond to the thoracic region of the spine and values greater than 12 belong to the lumbar region. We will therefore try and fit a model of the form
\[ f(x) = a+b(x-12)I_{[12, \infty)}(x) \] Notice that this model fits a continuous function. In R we can do this by including a term I((Start-12)*(Start>12)). The āIā is needed so R does not interpret the ā*ā as meaning interaction. Comparing this with the gam model we see that this model is as good as the generalized additive one.
fit.sa.glm <- glm(Kyphosis ~ poly(Age, 2) +
I((Start - 12) * (Start > 12)),
family = binomial, data = kyphosis)
anova(fit.sa.gam, fit.sa.glm)
## Analysis of Deviance Table
##
## Model 1: Kyphosis ~ s(Age) + s(Start)
## Model 2: Kyphosis ~ poly(Age, 2) + I((Start - 12) * (Start > 12))
## Resid. Df Resid. Dev Df Deviance
## 1 74.498 52.383
## 2 81.000 51.953 -6.5018 0.4301
What does this new model look like?
x <- seq(1, 200, length = 100)
y <- c(5, 10, 13, 14)
xy <- expand.grid(x, y)
df <- data.frame(Age=xy[, 1], Start=xy[, 2])
df$Prob <- predict(fit.sa.glm, df, type = "response")
ggplot(df, aes(Age, Prob)) +
geom_line() +
facet_wrap(~Start)
where we can see that the highest risk of kyphosis is for children around age 100 month (aka 8 years) but it diminishes the higher up the Start is.