Logistic Regression

Example

We begin with a very famous data set from the Challenger shuttle disaster. On Jan 28, 1986, at 11.38 am EST, the space shuttle challenger was launched from Cape Canaveral, Florida. The mission ended 73 seconds later when the Challenger exploded. All 7 crew members were killed.

What happened?

Hot propellant gases flew past the aft joint of the right solid rocket booster, burning through two rubber O-rings. An investigation ensued into the reliability of the shuttle’s propulsion system. The explosion was eventually traced to the failure of one of the three field joints on one of the two solid booster rockets. Each of these six field joints includes two O-rings, designated as primary and secondary, which fail when phenomena called erosion and blowby both occur.

The night before the launch a decision had to be made regarding launch safety. The discussion among engineers and managers leading to this decision included concern that the probability of failure of the O-rings depended on the temperature t at launch, which was forecast to be 31 degrees F. There are strong engineering reasons based on the composition of O-rings to support the judgment that failure probability may rise monotonically as temperature drops.

The discussion centered on the following data from the previous 23 shuttle launches:

kable.nice(head(shuttle))
Temp NumFail Failure
1 66 0 0
2 70 1 1
3 69 0 0
4 68 0 0
5 67 0 0
6 72 0 0
ggplot(data=shuttle, aes(Temp, NumFail)) +
  geom_point()

there seems to be a tendency for failures at lower temperatures.

The variable Failure is an indicator of failure or not:

plt <- ggplot(data=shuttle, aes(Temp, Failure)) +
  geom_jitter(height = 0) 
plt

So again we have a model of the form \(y_i=\beta_0+\beta_1 x_i +\epsilon_i\), but now \(y_1,..,y_n\) are Bernoulli rv’s. Therefore

\[ \begin{aligned} &E[y_i] = P(y_i=1) = p_i = 1-P(y_i=0)=1-E[y_i]\\ &var(y_i) =p_i(1-p_i) \end{aligned} \] Note that therefore

\[var(y_i)=p_i(1-p_i)=E[y_i](1-E[y_i])=(\beta_0+\beta_1 x_i)(1-\beta_0-\beta_1 x_i)\] and we also have unequal variances. This means that the usual least squares estimators are no longer optimal.

Another issue is that as always \(0\le p_i\le 1\) but using least squares \(\hat{p}_i=\hat{\beta}_0+\hat{\beta}_1 x_i\) need not be. What is needed is a model that always yields values in [0,1], that is a model of the form

\[l(p_i) = \beta_0+\beta_1 x_i +\epsilon_i\]

where l is a link function. One popular choice is the logit transform \(l(x)=\log(\frac{x}{1-x})\), which leads to logistic regression.

Note

\[ \begin{aligned} &\log(\frac{p_i}{1-p_i}) = \beta_0+\beta_1 x_i\\ &\frac{p_i}{1-p_i} = \exp \left\{\beta_0+\beta_1 x_i \right\} \\ &p_i = \frac{\exp \left\{\beta_0+\beta_1 x_i \right\}}{1+\exp \left\{\beta_0+\beta_1 x_i \right\}} = \\ &\frac1{1+\exp \left\{-\beta_0-\beta_1 x_i \right\}}\\ \end{aligned} \]

Here is what this looks like:

curve(exp(x)/(1+exp(x)),-5, 5,ylab="logit", col="blue", lwd=2)

Because the observed data is discrete and the predicted values are continuous \(y_i-\hat{y_i}\) no longer is sensible, and so least square can not be used for estimation. Instead one generally uses maximum likelihood:

\[ \begin{aligned} &L(\beta_0, \beta_1) =\prod_{i=1}^n f(x_i;\beta_0, \beta_1)=\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i} \\ &\log L(\beta_0, \beta_1) =\sum_i \left[ y_i\log p_i+(1-y_i)\log(1-p_i) \right]= \\ &\sum_i \left[ y_i\log \frac{\exp \left\{\beta_0+\beta_1 x_i \right\}}{1+\exp \left\{\beta_0+\beta_1 x_i \right\}}+(1-y_i)\log(1-\frac{\exp \left\{\beta_0+\beta_1 x_i \right\}}{1+\exp \left\{\beta_0+\beta_1 x_i \right\}}) \right] = \\ &\sum_i \left[ y_i(\beta_0+\beta_1 x_i)-y_i\log(1+\exp \left\{\beta_0+\beta_1 x_i\right\}-\\ (1-y_i)\log(1+\exp \left\{\beta_0+\beta_1 x_i\right\}) \right] = \\ &\sum_i y_i(\beta_0+\beta_1 x_i)-\sum_i\log(1+\exp \left\{\beta_0+\beta_1 x_i\right\} \end{aligned} \]

\[ \begin{aligned} &d \log L(\beta_0, \beta_1)/d \beta_0 = \sum_i y_i-\sum_i \frac{\exp \left\{\beta_0+\beta_1 x_i\right\}}{1+\exp \left\{\beta_0+\beta_1 x_i\right\}}=0\\ &d \log L(\beta_0, \beta_1)/d \beta_1 = \sum_i y_ix_i-\sum_i \frac{\exp \left\{\beta_0+\beta_1 x_i\right\}x_i}{1+\exp \left\{\beta_0+\beta_1 x_i\right\}} = 0\\ \end{aligned} \]

and these equations have to be solved numerically.

Clearly if there are more than one predictor the model becomes

\[\log(\frac{p_i}{1-p_i})=\pmb{X'\beta}\] #### Example

for the Challenger disaster data we find

ll=function(beta) 
  -(sum(shuttle$Failure*(beta[1]+beta[2]*shuttle$Temp))-
   sum(log(1+exp(beta[1]+beta[2]*shuttle$Temp))))
fit=optim(c(0, 0), ll)
fit
## $par
## [1] 15.0453570 -0.2321977
## 
## $value
## [1] 10.1576
## 
## $counts
## function gradient 
##       99       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
x=seq(30, 90, length=100)
y=exp(fit$par[1]+fit$par[2]*x)/(1+exp(fit$par[1]+fit$par[2]*x))
df=data.frame(x=x, y=y)
plt + 
  geom_line(aes(x,y), data=df, col="blue", size=1.2)

so at the expected launch temperature of 38 degrees F the estimated probability of a failure was 1.

Using R we can fit such a logistic regression model with

fit <- glm(Failure~Temp, 
           family=binomial,
           data=shuttle)
fit
## 
## Call:  glm(formula = Failure ~ Temp, family = binomial, data = shuttle)
## 
## Coefficients:
## (Intercept)         Temp  
##     15.0429      -0.2322  
## 
## Degrees of Freedom: 22 Total (i.e. Null);  21 Residual
## Null Deviance:       28.27 
## Residual Deviance: 20.32     AIC: 24.32

What would be a 95% confidence interval for the probability at 32F?

tmp <- predict(fit, data.frame(Temp=32), 
     type="response", se.fit=TRUE)
round(tmp$fit +c(-1, 1)*qnorm(0.975)*tmp$se.fit, 3)
## [1] 0.996 1.003

but there is something silly about this interval: it goes beyond 1! This is a consequence of using normal theory intervals. Here is a better solution:

tmp <- predict(fit, data.frame(Temp=32), 
     type="link", se.fit=TRUE)
e <- tmp$fit
r <- tmp$se.fit
cr <- qnorm(0.975)
round(c(exp(e-cr*r)/(1+exp(e-cr*r)),
        exp(e+cr*r)/(1+exp(e+cr*r))), 3)
##     1     1 
## 0.476 1.000

but this has a much lower (likely to low) lower limit.

Example

UPR Admissions data

Let’s say we want to predict whether or not somebody will graduate from UPR as a function of their Highschool GPA. We will give them six years to do so, so we will consider all applicants from 2003-2007:

dta <- upr[upr$Year<=2007, 
           c("Highschool.GPA", "Graduated")]
dta$GradInd <- ifelse(dta$Graduated=="Si", 1, 0)
plt <- ggplot(dta, aes(Highschool.GPA, GradInd)) +
  geom_jitter(width=0, height=0.1)
plt

fit <- glm(GradInd~Highschool.GPA, 
           family=binomial,
           data=dta)
fit
## 
## Call:  glm(formula = GradInd ~ Highschool.GPA, family = binomial, data = dta)
## 
## Coefficients:
##    (Intercept)  Highschool.GPA  
##         -6.252           1.680  
## 
## Degrees of Freedom: 11409 Total (i.e. Null);  11408 Residual
## Null Deviance:       15790 
## Residual Deviance: 14960     AIC: 14960
x <- seq(2, 4, length=100)
df <- data.frame(x=x, 
         y=predict(fit, data.frame(Highschool.GPA=x), 
                                type="response"))
plt +
  geom_line(data=df, aes(x, y),
            color="blue", size=1.2)

so the probability of a successful graduation does increase, but only to about 62%.

Let’s find a pointwise confidence band for the success probability:

tmp <- predict(fit, data.frame(Highschool.GPA=x), 
     type="link", se.fit=TRUE)
e <- tmp$fit
r <- tmp$se.fit
cr <- qnorm(0.975)
ymin <- exp(e-cr*r)/(1+exp(e-cr*r))
ymax <- exp(e+cr*r)/(1+exp(e+cr*r))
df1 <- data.frame(x=x, ymin=ymin, ymax=ymax)
ggplot(dta, aes(Highschool.GPA, GradInd)) +
 geom_line(data=df, aes(x, y),
            color="blue", size=1.2) + 
 geom_ribbon(data=df1, 
              aes(x=x, ymin=ymin, ymax=ymax),
              alpha=0.2, inherit.aes = FALSE)

Poisson Regression

Example

Warp Breaks: the data set gives the results of an experiment to determine the effect of wool type (A or B) and tension (low, medium or high) on the number of warp breaks per loom. Data was collected for nine looms for each combination of settings.

kable.nice(head(warpbreaks))
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L

we want to build a model relating the wool type and tension to the number of breaks.

What distribution might be appropriate for breaks? First, let’s have a look at them:

bw <- diff(range(warpbreaks$breaks))/20 
ggplot(warpbreaks, aes(x=breaks)) +
  geom_histogram(color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Breaks")

our data is counts with a bit of skew to the right. This is typical for data from a Poisson distribution.

Here is another argument in favor of a Poisson: Each loom could be considered as a series of small intervals. We then would have a large number of such intervals, each of which has a small probability of a break. The total number of breaks would be the sum of the breaks in each interval, and therefore would be Binomial. But in this case the Poisson approximation to the Binomial would be very good.

Again we want to use regression to relate type and tension to breaks. In the case of a Poisson response variable the link function is given by the logarithm.

fit <- glm(breaks~wool*tension, 
           data=warpbreaks, 
           family=poisson) 
summary(fit)
## 
## Call:
## glm(formula = breaks ~ wool * tension, family = poisson, data = warpbreaks)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3383  -1.4844  -0.1291   1.1725   3.5153  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)     3.79674    0.04994  76.030  < 2e-16
## woolB          -0.45663    0.08019  -5.694 1.24e-08
## tensionM       -0.61868    0.08440  -7.330 2.30e-13
## tensionH       -0.59580    0.08378  -7.112 1.15e-12
## woolB:tensionM  0.63818    0.12215   5.224 1.75e-07
## woolB:tensionH  0.18836    0.12990   1.450    0.147
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 297.37  on 53  degrees of freedom
## Residual deviance: 182.31  on 48  degrees of freedom
## AIC: 468.97
## 
## Number of Fisher Scoring iterations: 4

and we see that all terms except one interaction term are stat. significant.


Let’s do our own little study of Poisson regression. First we generate some data:

x <- 1:100/50
df <- data.frame(x=x, y=rpois(100, 10*x))
plt <- ggplot(data=df, aes(x, y)) +
  geom_point() 
plt

fit <- glm(y~x, 
           data=df, 
           family=poisson) 
summary(fit)
## 
## Call:
## glm(formula = y ~ x, family = poisson, data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6425  -0.9848  -0.1348   0.5833   2.9190  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.03973    0.08816   11.79   <2e-16
## x            1.05278    0.06143   17.14   <2e-16
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 472.36  on 99  degrees of freedom
## Residual deviance: 146.59  on 98  degrees of freedom
## AIC: 524.89
## 
## Number of Fisher Scoring iterations: 5
df1 <- df
df1$y <- predict(fit, type="response")
plt + 
  geom_line(data=df1, aes(x, y), color="blue", size=1.2)

and that looks quite good!

Generalized Linear Models

Logistic and Poisson regression are examples of Generalized Linear Models, which can be characterized as follows: we have

  1. Independent random variables \(y_1,..,y_n\) with \(E[y_i]=\mu_i\) and density function from an exponential family.

  2. A linear predictor \(\pmb{x_i'\beta}\)

  3. A link function \(g(\mu_i)=\pmb{x_i'\beta}\)

Definition

A distribution is said to belong to the exponential family if its density can be written as

\[ f(\pmb{x};\pmb{\theta}) =h(\pmb{x})\exp \left\{ \pmb{\theta}' T(\pmb{x}) -A(\pmb{\theta}) \right\} \] where

  • \(\theta\) is a vector of parameters
  • \(T(x)\) is a vector of sufficient statistics
  • A is a function of \(\theta\) alone and h is a function of x alone

we have

\[ \begin{aligned} &\int f(x;\theta) dx = \\ &\int h(x)\exp \left\{ \theta^T T(x) -A(\theta) \right\} dx = \\ & \exp \left\{ -A(\theta) \right\} \int h(x)\exp \left\{ \theta^T T(x) \right\} dx = 1\\ \end{aligned} \]

so

\[ A(\theta) = \log \left[ \int h(x)\exp \left\{ \theta^T T(x) \right\} dx\right] \]

Examples

  • Bernoulli

\[ \begin{aligned} & f(x;p) = p^x (1-p)^{1-x}\\ & \exp \left\{ x \log p + (1-x) \log (1-p) \right\} = \\ & \exp \left\{ x (\log p - \log (1-p)) + \log (1-p) \right\} =\\ & \exp \left\{ x \log \frac{p}{1-p} + \log (1-p) \right\}\\ &\exp \left\{ x\theta - \log(1+e^\theta)\right\} \end{aligned} \] \[ \begin{aligned} & \theta = \log \frac{p}{1-p}\\ & h(x) = 1 \\ & T(x) = x\\ &A(\theta) = -\log(1+e^\theta) \end{aligned} \]

because

\[ \begin{aligned} &\theta =\log \frac{p}{1-p} \\ &e^\theta = \frac{p}{1-p} \\ &p = \frac{e^\theta}{1+e^\theta}\\ &1-p = \frac{1}{1+e^\theta}\\ &\log(1-p) = -\log(1+e^\theta)\\ \end{aligned} \]

  • Normal

\[ \begin{aligned} &\frac{1}{\sqrt{2\pi \sigma^2}} \exp \left\{ -\frac1{2 \sigma^2} \left( x-\mu \right)^2 \right\} = \\ & \frac{1}{\sqrt{2\pi }}\exp \left\{- \frac1{2 \sigma^2} \left( x^2-2x \mu + \mu^2 \right)- \log \sigma \right \} = \\ &\frac{1}{\sqrt{2\pi }}\exp \left\{ -\frac{x^2}{ \sigma^2} + \frac{x \mu}{ \sigma^2} - \frac{\mu^2}{2 \sigma^2} - \log \sigma \right \} \end{aligned} \] so

\[ \begin{aligned} & \theta = (\mu/\sigma^2, -1/(2\sigma^2)^T\\ & h(x) = \frac{1}{\sqrt{2\pi }} \\ & T(x) = (x, x^2)^T\\ &A(\theta) = \frac{\mu^2}{2 \sigma^2} + \log \sigma = \\ &-\theta_1^2/(4\theta_2)-\frac12 \log (-2 \theta_2) \end{aligned} \]

The likelihood function is given by

\[L(\pmb{y};\pmb{\theta}) =\prod_i h(\pmb{y_i})\exp \left\{ \pmb{\theta}' T(\pmb{x}) -A(\pmb{\theta}) \right\}=\] \[[\prod_i h(\pmb{y_i})]\exp \left\{ \sum_i\pmb{\theta}' T(\pmb{x}) -nA(\pmb{\theta}) \right\}\] and so

\[\log L(\pmb{y};\pmb{\theta})=K\left\{ \sum_i\pmb{\theta}' T(\pmb{x}) -nA(\pmb{\theta}) \right\}\] For the exponential family we have

\[E[y_i]=\mu_i=A'(\theta)\]