Logistic Regression - General Linear Models

In all the examples so far we always had a quantitative response. In this section we will study the case of a categorical response.

Example: Challenger shuttle disaster

We begin with a very famous dataset from the Challenger shuttle disaster. On Jan 28, 1986, at 11.38 am EST, the space shuttle challenger was launched from Cape Canaveral, Florida. The mission ended 73 seconds later when the Challenger exploded. All 7 crew members were killed.

Challenger Disaster Movie

What happened?

Hot propellant gases flew past the aft joint of the right solid rocket booster, burning through two rubber O-rings. An investigation ensued into the reliability of the shuttle’s propulsion system. The explosion was eventually traced to the failure of one of the three field joints on one of the two solid booster rockets. Each of these six field joints includes two O-rings, designated as primary and secondary, which fail when phenomena called erosion and blowby both occur.

The night before the launch a decision had to be made regarding launch safety. The discussion among engineers and managers leading to this decision included concern that the probability of failure of the O-rings depended on the temperature t at launch, which was forecase to be 31 degrees F. There are strong engineering reasons based on the composition of O-rings to support the judgment that failure probability may rise monotonically as temperature drops.

The discussion centered on the following data from the previous 23 shuttle launches:

kable.nice(head(shuttle))

Temp	NumFail	Failure
66	0	0
70	1	1
69	0	0
68	0	0
67	0	0
72	0	0

ggplot(data=shuttle, aes(Temp, NumFail)) +
  geom_point()

there seems to be a tendency for failures at lower temperatures.

The variable Failure is an indicator of failure or not:

plt <- ggplot(data=shuttle, aes(Temp, Failure)) +
  geom_jitter(height = 0) 
plt

Again we want to use regression to study the relationship between temperature and failure. But now failure is categorical, and so the x and the y variable are no longer measured in the same way.

The way to connect them is to predict the probability of failure, which again is a quantitative variable. This is done as follows: we have responses \(y_1, .., y_n\) which can be modeled as \(Y_i \sim Ber(\pi_i)\). The \(\pi\) are related to the predictor variable \(x\) via the link function

\[ \log \left(\frac{p}{1-p} \right) =\alpha + \beta x \]

this is called the logit link function. There are others as well.

We can invert the logit function:

\[ \pi(x)=\frac{e^{\alpha + \beta x}}{1+e^{\alpha + \beta x}} \]

notice that this rules out \(\pi(x)=0\) or \(1\). There are other link functions that don’t do that.

How do fit such a model, that is find \(\alpha\) and \(\beta\)? For linear regression we used the method of least squares, which was possible because we could directly compare x and y. This is not the case here because p and x have different forms, which is why we needed a link function. Instead one uses maximum likelihood for the estimation. In R this is done using the command glm with the argument family=binomial:

fit <- glm(Failure~Temp, 
           family=binomial,
           data=shuttle)
fit

## 
## Call:  glm(formula = Failure ~ Temp, family = binomial, data = shuttle)
## 
## Coefficients:
## (Intercept)         Temp  
##     15.0429      -0.2322  
## 
## Degrees of Freedom: 22 Total (i.e. Null);  21 Residual
## Null Deviance:       28.27 
## Residual Deviance: 20.32     AIC: 24.32

Let’s see what this looks like

x <- seq(30, 80, 1)
df <- data.frame(x=x, 
                 y=predict(fit, data.frame(Temp=x), 
                                type="response"))
plt +
  geom_line(data=df, aes(x, y),
            color="blue", size=1.2)

we see that at the expected launch temperature of 32F the failure probability is 1.

What would be a \(95\%\) confidence interval for the probability at 32F?

tmp <- predict(fit, data.frame(Temp=32), 
     type="response", se.fit=TRUE)
round(tmp$fit +c(-1, 1)*qnorm(0.975)*tmp$se.fit, 3)

## [1] 0.996 1.003

but there is something silly about this interval: it goes beyond 1! This is a consequence of using normal theory intervals. Here is a better solution:

tmp <- predict(fit, data.frame(Temp=32), 
     type="link", se.fit=TRUE)
e <- tmp$fit
r <- tmp$se.fit
e

##        1 
## 7.613694

## [1] 3.933421

cr <- qnorm(0.975)
round(c(exp(e-cr*r)/(1+exp(e-cr*r)),
        exp(e+cr*r)/(1+exp(e+cr*r))), 3)

##     1     1 
## 0.476 1.000

but this has a much lower (likely to low) lower limit.

Example: Graduated or not UPR.

Let’s say we want to predict whether or not somebody will graduate from UPR as a function of their Highschool GPA. We will give them six years to do so, so we will consider all aplicants from 2003-2007:

dta <- upr[upr$Year<=2007, 
           c("Highschool.GPA", "Graduated")]
dta$GradInd <- ifelse(dta$Graduated=="Si", 1, 0)
plt <- ggplot(dta, aes(Highschool.GPA, GradInd)) +
  geom_jitter(width=0, height=0.1)
plt

fit <- glm(GradInd~Highschool.GPA, 
           family=binomial,
           data=dta)
fit

## 
## Call:  glm(formula = GradInd ~ Highschool.GPA, family = binomial, data = dta)
## 
## Coefficients:
##    (Intercept)  Highschool.GPA  
##         -6.252           1.680  
## 
## Degrees of Freedom: 11409 Total (i.e. Null);  11408 Residual
## Null Deviance:       15790 
## Residual Deviance: 14960     AIC: 14960

x <- seq(2, 4, length=100)
df <- data.frame(x=x, 
         y=predict(fit, data.frame(Highschool.GPA=x), 
                                type="response"))
plt +
  geom_line(data=df, aes(x, y),
            color="blue", size=1.2)

so the probability of a successful graduation does increase, but only to about 62%.

Let’s find a pointwise confidence band for the success probability:

tmp <- predict(fit, data.frame(Highschool.GPA=x), 
     type="link", se.fit=TRUE)
e <- tmp$fit
r <- tmp$se.fit
cr <- qnorm(0.975)
ymin <- exp(e-cr*r)/(1+exp(e-cr*r))
ymax <- exp(e+cr*r)/(1+exp(e+cr*r))
df1 <- data.frame(x=x, ymin=ymin, ymax=ymax)
ggplot(dta, aes(Highschool.GPA, GradInd)) +
 geom_line(data=df, aes(x, y),
            color="blue", size=1.2) + 
 geom_ribbon(data=df1, 
              aes(x=x, ymin=ymin, ymax=ymax),
              alpha=0.2, inherit.aes = FALSE)

Example: Warp Breaks

The data set gives the results of an experiment to determine the effect of wool type (A or B) and tension (low, medium or high) on the number of warp breaks per loom. Data was collected for nine looms for each combination of settings.

kable.nice(head(warpbreaks))

breaks	wool	tension
26	A	L
30	A	L
54	A	L
25	A	L
70	A	L
52	A	L

we want to build a model relating the wool type and tension to the number of breaks.

What distribution might be appropriate for breaks? First, let’s have a look at them:

bw <- diff(range(warpbreaks$breaks))/20 
ggplot(warpbreaks, aes(x=breaks)) +
  geom_histogram(color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Breaks")

our data is counts with a bit of skew to the right. This is typical for data from a Poisson distribution.

Here is another argument in favor of a Poisson: Each loom could be considered as a series of small intervals. We then would have a large number of such intervals, each of which has a small probability of a break. The total number of breaks would be the sum of the breaks in each interval, and therefore would be Binomial. But in this case the Poisson approximation to the Binomial would be very good.

Again we want to use regression to relate type and tension to breaks. In the case of a Poisson response variable the link function is given by the logarithm.

fit <- glm(breaks~wool*tension, 
           data=warpbreaks, 
           family=poisson) 
summary(fit)

## 
## Call:
## glm(formula = breaks ~ wool * tension, family = poisson, data = warpbreaks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3383  -1.4844  -0.1291   1.1725   3.5153  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)     3.79674    0.04994  76.030  < 2e-16
## woolB          -0.45663    0.08019  -5.694 1.24e-08
## tensionM       -0.61868    0.08440  -7.330 2.30e-13
## tensionH       -0.59580    0.08378  -7.112 1.15e-12
## woolB:tensionM  0.63818    0.12215   5.224 1.75e-07
## woolB:tensionH  0.18836    0.12990   1.450    0.147
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 297.37  on 53  degrees of freedom
## Residual deviance: 182.31  on 48  degrees of freedom
## AIC: 468.97
## 
## Number of Fisher Scoring iterations: 4

and we see that all terms except one interaction term are stat. significant.

Let’s do our own little study of Poisson regression. First we generate some data:

x <- 1:100/50
df <- data.frame(x=x, y=rpois(100, 10*x))
plt <- ggplot(data=df, aes(x, y)) +
  geom_point() 
plt

fit <- glm(y~x, 
           data=df, 
           family=poisson) 
summary(fit)

## 
## Call:
## glm(formula = y ~ x, family = poisson, data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6425  -0.9848  -0.1348   0.5833   2.9190  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.03973    0.08816   11.79   <2e-16
## x            1.05278    0.06143   17.14   <2e-16
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 472.36  on 99  degrees of freedom
## Residual deviance: 146.59  on 98  degrees of freedom
## AIC: 524.89
## 
## Number of Fisher Scoring iterations: 5

df1 <- df
df1$y <- predict(fit, type="response")
plt + 
  geom_line(data=df1, aes(x, y), color="blue", size=1.2)

and that looks quite good!