Models with Categorical Predictors

Example: Environmental, Safety and Health Attitudes

Environment, Safety and Health Attitudes of employees of a laboratory. Employees are given a questionnaire, which is then collated into an average score from 1(bad) to 10(good). We also have available the length of service of the employee and their gender.

kable.nice(head(esh))
ES.H Yrs.Serv Sex
7.6 5 Female
9.0 30 Female
8.0 12 Female
6.8 7 Female
7.4 7 Female
9.8 27 Female

One of the predictor variables (Sex) is actually categorical. A categorical variable used in a regression model is often referred to as a dummy variable.

Let’s start by looking at each predictor separately.

  • Years is quantitative, so do the scatterplot:
attach(esh)
ggplot(data=esh, aes(Yrs.Serv, ES.H)) + 
    geom_point() +
  geom_smooth(method = "lm", se=FALSE)

  • Sex is categorical, so do the boxplot:
ggplot(data=esh, aes(Sex, ES.H)) + 
    geom_boxplot()

The values in Sex (Male, Female) are text but in a regression we need everything to be numeric, so in order to use Sex in a regression model we first have to code the variable as numbers, for example Female=0 and Male=1. Then

SexCode <- rep(0, length(Sex))
SexCode[Sex=="Male"] <- 1
esh1 <- data.frame(ESH=esh$ES.H, 
                   YrsServ=esh$Yrs.Serv, 
                   SexCode=SexCode)
fit <- lm(ESH~., data=esh1)
summary(fit)
## 
## Call:
## lm(formula = ESH ~ ., data = esh1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23832 -0.49061 -0.05023  0.49141  1.49221 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  7.03542    0.35862  19.618 4.10e-13
## YrsServ      0.09695    0.02228   4.351 0.000435
## SexCode     -2.59099    0.36058  -7.186 1.52e-06
## 
## Residual standard error: 0.7861 on 17 degrees of freedom
## Multiple R-squared:  0.8394, Adjusted R-squared:  0.8205 
## F-statistic: 44.44 on 2 and 17 DF,  p-value: 1.771e-07
  pushViewport(viewport(layout = grid.layout(1, 2)))
  df <- data.frame(Residuals=resid(fit), 
            Fits = fitted(fit))
  print(ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line(),
    vp=viewport(layout.pos.row=1, layout.pos.col=1))
  print(ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0),
    vp=viewport(layout.pos.row=1, layout.pos.col=2))        

The residual vs. fits and normal plot look good, so this is a good model.


Or is it?

Let’s do the following: what would the equation look like if we knew the person was female? (or male). Well:

\[ \begin{aligned} &\text{Female ES.H} = \\ &7.035 + 0.097 \text{Yrs.Serv} - 2.591 \cdot 0 = \\ &7.035 + \mathbf{0.097} \text{Yrs.Serv} \\ \end{aligned} \]

\[ \begin{aligned} & \text{Male ES.H} = \\ & 7.035 + 0.097 \text{Yrs.Serv} - 2.591 \cdot 1 = \\ & 4.444 + \mathbf{0.097} \text{Yrs.Serv} \\ \end{aligned} \]

Notice that both equations have the same slope, so we have parallel lines.

Note such a model is also often called an additive model, similar to an ANOVA without interaction!

What does this look like? Here it is:

ggplot(data=esh, aes(Yrs.Serv, ES.H, color=Sex)) +
         geom_point() +
         scale_color_manual(values=c("red", "blue")) +
         geom_abline(intercept = c(7.035, 4.444), 
                     slope = c(0.097, 0.097), 
                     color=c("red", "blue")) 

Now a model with parallel line may or may not make sense for our data, but it does not have to. Except that no matter what, the way we used the categorical variable (simply code it and use it) we will always result in parallel lines!

Is there a way to see whether this is ok here? Yes, what we need is a version of the residual vs fits plot that identifies the plotting symbols by Sex. If the model is good, this residual vs fits plot should also show no pattern.

ggplot(data=df, aes(Fits, Residuals, color=Sex)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE) +
  geom_hline(yintercept = 0)

and as we can see there is a definite pattern in the colors.


So, how do we get away from parallel lines? This can be done by adding a variable Yrs.Serv*SexCode.

esh1$prod <- esh1$YrsServ*esh1$SexCode
fit.prod <- lm(ESH~., data=esh1)
summary(fit.prod)
## 
## Call:
## lm(formula = ESH ~ ., data = esh1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0280 -0.4430 -0.1206  0.3965  1.3300 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  7.32289    0.39824  18.388 3.48e-12
## YrsServ      0.07216    0.02736   2.637   0.0179
## SexCode     -3.20289    0.54300  -5.898 2.25e-05
## prod         0.06534    0.04443   1.470   0.1608
## 
## Residual standard error: 0.7605 on 16 degrees of freedom
## Multiple R-squared:  0.8585, Adjusted R-squared:  0.832 
## F-statistic: 32.37 on 3 and 16 DF,  p-value: 5.004e-07

and now:

\[ \begin{aligned} &\text{Female ES.H} =\\ &7.323 + 0.072 \text{Yrs.Serv} - 3.203 \cdot 0 +0.065 \cdot \text{Yrs.Serv*0}=\\ &7.323 + 0.072 \text{Yrs.Serv} \end{aligned} \]

\[ \begin{aligned} &\text{Male ES.H} =\\ &7.323 + 0.072 \text{Yrs.Serv} - 3.203 \cdot 1 + 0.065 \cdot \text{Yrs.Serv*1}=\\ &4.120 + 0.138 \text{Yrs.Serv} \end{aligned} \]

and so this fits two separate lines.

ggplot(data=esh, aes(Yrs.Serv, ES.H, color=Sex)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE)

Now the residual vs. fits plot looks like this:

df <- data.frame(Residuals=resid(fit.prod), 
            Fits = fitted(fit.prod))
ggplot(data=df, aes(Fits, Residuals, color=Sex)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE) +
  geom_hline(yintercept = 0)

Note you can get the same two equations by splitting up the data set into two parts, the score and years of the Females and the score and years of the Males, and then doing a simple regression for both:

round(coef(lm(ES.H[Sex=="Female"]~Yrs.Serv[Sex=="Female"])), 3)
##               (Intercept) Yrs.Serv[Sex == "Female"] 
##                     7.323                     0.072
round(coef(lm(ES.H[Sex=="Male"]~Yrs.Serv[Sex=="Male"])), 3)
##             (Intercept) Yrs.Serv[Sex == "Male"] 
##                   4.120                   0.137

Doing one multiple regression has some advantages, though. For example you get one R2 for the whole problem, not two for each part. Moreover, usually this R2 will be higher than either of the other two.

Above we fitted the independent lines model by explicitly calculating the product term. A better way is to do this:

esh2 <- esh
esh2$Sex <- SexCode
fit.prod <- lm(ES.H~.^2, data=esh2)
round(coef(fit.prod), 3)
##  (Intercept)     Yrs.Serv          Sex Yrs.Serv:Sex 
##        7.323        0.072       -3.203        0.065

So now we have two models:

  • parallel lines: ES.H = 7.035 + 0.097 Yrs.Serv - 2.591 Sex

R2 = 83.9%

  • separate lines: ES.H = 7.323 + 0.072 Yrs.Serv - 3.203 SexCode + 0.065 Yrs.Serv*SexCode

R2=85.85%

Clearly the second one has a higher R2, but then the first one is a special case of the second (nested models) and so the model with parallel lines will never have an R2 higher than the model with separate lines, and usually always has an R2 a bit lower.

Of course the parallel lines model has two terms while the other one has three, and the third one is more complicated, so we would prefer the parallel lines model, if possible.

What we want to know is whether the model with two separate lines is statistically significantly better than the model with parallel lines. So we need a hypothesis test with:

H0: the two separate lines model is NOT statistically significantly better than the parallel lines model.

Ha: the two separate lines model is statistically significantly better than the parallel lines model.

Notice that the parallel lines model is a special case of the two independent lines model, and so we can again use the anova to decide which is better:

anova(fit.prod, fit) 
## Analysis of Variance Table
## 
## Response: ES.H
##              Df Sum Sq Mean Sq F value    Pr(>F)
## Yrs.Serv      1 23.009  23.009 39.7826 1.043e-05
## Sex           1 31.905  31.905 55.1642 1.434e-06
## Yrs.Serv:Sex  1  1.251   1.251  2.1623    0.1608
## Residuals    16  9.254   0.578

gives a p-value of 0.1608 > 0.05, so the parallel lines model is just as good as the model with separate lines.

Prediction

Let’s find 95% interval estimates for female employees with 0, 1, 2,..,10 years of service, using the parallel lines model:

fit <- lm(ES.H~., data=esh2)
nw <- data.frame(Yrs.Serv=0:10, Sex=rep(0, 11))
round(predict(fit, nw, interval="prediction"), 2)
##     fit  lwr  upr
## 1  7.04 5.21 8.86
## 2  7.13 5.32 8.94
## 3  7.23 5.43 9.03
## 4  7.33 5.54 9.11
## 5  7.42 5.65 9.20
## 6  7.52 5.75 9.29
## 7  7.62 5.86 9.38
## 8  7.71 5.96 9.47
## 9  7.81 6.06 9.56
## 10 7.91 6.16 9.65
## 11 8.00 6.26 9.75

Lines and Interaction

Above we explained the problem of using categorical predictors in a regression model in terms of parallel lines vs. two independent lines. But in fact this another example of the issue of interaction, or more generally of a relationship between the predictors. Parallel lines are ok if the categorical and the continuous predictors are essentially independent. Often terms such as Yrs Serv*SexCode are also called interaction terms.

For your purposes in this class (and later when doing work such as this) simply remember to include product terms when you have categorical predictors. Then you can test if that term is really needed, and drop it if it is not.

Example: Sales of Shoes

The number of shoes sold by year and type.

kable.nice(head(shoesales))
Sales Year Type
1539 1 Mens
12984 1 Kids
25809 1 Ladies
5742 2 Mens
30058 2 Kids
34764 2 Ladies

Let’s have a look at the data:

ggplot(data=shoesales, aes(Year, Sales, color=Type)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE)

We want to find a model for predicting Sales from Year and Type. Again Type is a categorical variable and so we need to code it. The most obvious thing to do would be to code:

  • Mens= 0
  • Kids= 1
  • Ladies = 2

but that is dangerous. Unlike a categorical variable numbers always have an order and a size. So by coding in this way we are saying that Mens comes before Kids. Worse , we are saying that the “distance” from Mens to Kids is the same as the “distance” from Kids to Ladies!

Whether this matters or not depends on the specific problem. There is however a way to include such a variable without introducing order or size:

d1 <- rep(0, length(Type))
d1[Type=="Kids"] <- 1
d2 <- rep(0, length(Type))
d2[Type=="Ladies"] <- 1

Notice that by knowing d1 and d2 we now exactly what the type is:

  • d1=0, d2=0 → Mens
  • d1=1, d2=0 → Kids
  • d1=0, d2=1 → Ladies

so we have not lost any information, but we have also not introduced any order or size!

Now

df <- shoesales[, 1:2] 
df$d1 <- d1
df$d2 <- d2
fit <- lm(Sales~., data=df)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12963.7  -3433.5   -469.7   3349.1  22146.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1429.1     1917.8   0.745    0.459
## Year          1551.6      115.4  13.440  < 2e-16
## d1           12774.5     1875.7   6.811 3.74e-09
## d2           26986.9     1875.7  14.388  < 2e-16
## 
## Residual standard error: 6361 on 65 degrees of freedom
## Multiple R-squared:  0.8565, Adjusted R-squared:  0.8498 
## F-statistic: 129.3 on 3 and 65 DF,  p-value: < 2.2e-16

This is of course an additive model, again we should worry about interaction. But now we have two categorical predictors, so we need to add two product terms:

fit.prod <- lm(Sales~.^2-d1:d2, data=df)
summary(fit.prod)
## 
## Call:
## lm(formula = Sales ~ .^2 - d1:d2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11588.7  -3433.0   -256.7   2947.3  16121.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   7000.5     2443.6   2.865 0.005662
## Year          1087.3      178.2   6.101 7.14e-08
## d1            8862.2     3455.7   2.565 0.012728
## d2           14185.1     3455.7   4.105 0.000119
## Year:d1        326.0      252.0   1.294 0.200541
## Year:d2       1066.8      252.0   4.233 7.64e-05
## 
## Residual standard error: 5669 on 63 degrees of freedom
## Multiple R-squared:  0.8895, Adjusted R-squared:  0.8807 
## F-statistic: 101.4 on 5 and 63 DF,  p-value: < 2.2e-16

And again we can test whether the product terms are needed:

anova(fit.prod, fit)  
## Analysis of Variance Table
## 
## Model 1: Sales ~ (Year + d1 + d2)^2 - d1:d2
## Model 2: Sales ~ Year + d1 + d2
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)
## 1     63 2024940688                               
## 2     65 2629827256 -2 -604886567 9.4096 0.0002656

and we find that here the interaction is needed (p = 0.0003).

Example: Headache and Pain Reliever

A pharmaceutical company set up an experiment in which patients with a common type of headache were treated with a new analgesic or pain reliever. The analgesic was given to each patient in one of four dosage levels: 2, 5, 7 or 10 grams. Then the time until noticeable relieve was recorded in minutes. In addition the sex (coded as Female=0 and Male=1) and the blood pressure of each patient was recorded. The blood pressure groups where formed by comparing each patients diastolic and systolic pressure reading with historical data. Based on this comparison the patients are assigned to one of three types: low (0.25), medium (0.5), high (0.75) according to the respective quantiles of the historic data.

head(headache)
##   Time Dose Sex BP.Quan
## 1   35    2   0    0.25
## 2   43    2   0    0.50
## 3   55    2   0    0.75
## 4   47    2   1    0.25
## 5   43    2   1    0.50
## 6   57    2   1    0.75

here Sex and BP.Quan are already coded. BP.Quan is an interesting case because although it is categorical, it does have ordering and even a little bit of “size”.

we want to determine the optimal dosage for each patient, possibly depending on sex and blood pressure.

pushViewport(viewport(layout = grid.layout(1, 2)))
print(ggplot(data=headache, 
             aes(Dose, Time, color=factor(Sex))) +
        geom_point() +
        geom_smooth(method = "lm", se=FALSE) +
        theme(legend.position="none") +
        labs(title="Sex"),
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=headache, 
             aes(Dose, Time, color=factor(BP.Quan))) +
        geom_point() +
        geom_smooth(method = "lm", se=FALSE) +
        theme(legend.position="none") +
        labs(title="BP.Quan"),
  vp=viewport(layout.pos.row=1, layout.pos.col=2))      

Let’s start by fitting a linear model on Dose alone:

fit <- lm(Time~Dose, data=headache)
df <- data.frame(Residuals=resid(fit), 
            Fits = fitted(fit))
ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0)

There is a bit of a pattern here, so let’s try a quadratic model. In this example we will eventually need the actual equations, so we won’t use poly:

headache$Dose2 <- headache$Dose^2
fit <- lm(Time~Dose+Dose2, data=headache)
df <- data.frame(Residuals=resid(fit), 
            Fits = fitted(fit))
ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0)

and that looks better.

Now we will include the other two variables. One interesting question is whether BP.Quan is quantitative or categorical (in which case we should turn it into two dummy variables). The answer is not clear, and we will leave it alone. So

fit <- lm(Time~(Dose+Sex+BP.Quan)^3+Dose2, data=headache)
pushViewport(viewport(layout = grid.layout(1, 2)))
df <- data.frame(Residuals=resid(fit), 
            Fits = fitted(fit))
print(ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line(),
    vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(data=df, aes(Fits, Residuals)) +
            geom_point() +
            geom_hline(yintercept = 0),
    vp=viewport(layout.pos.row=1, layout.pos.col=2))        

summary(fit)
## 
## Call:
## lm(formula = Time ~ (Dose + Sex + BP.Quan)^3 + Dose2, data = headache)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9706 -2.4191 -0.1397  2.7665  7.5490 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)       41.2173     8.2720   4.983 0.000164
## Dose              -7.0353     1.8218  -3.862 0.001536
## Sex                4.3137    10.7536   0.401 0.693972
## BP.Quan           48.3235    14.0798   3.432 0.003705
## Dose2              0.5111     0.1184   4.316 0.000612
## Dose:Sex           1.0588     1.6120   0.657 0.521244
## Dose:BP.Quan      -7.4706     2.1106  -3.539 0.002973
## Sex:BP.Quan       -9.2941    19.9118  -0.467 0.647376
## Dose:Sex:BP.Quan  -0.1176     2.9849  -0.039 0.969080
## 
## Residual standard error: 4.351 on 15 degrees of freedom
## Multiple R-squared:  0.9418, Adjusted R-squared:  0.9108 
## F-statistic: 30.35 on 8 and 15 DF,  p-value: 6.567e-08

we see that the three-way interaction Dose:Sex:BP.Quan is not stat. significant (p=0.969), so we drop it:

fit <- lm(Time~(Dose+Sex+BP.Quan)^2+Dose2, data=headache)
summary(fit)
## 
## Call:
## lm(formula = Time ~ (Dose + Sex + BP.Quan)^2 + Dose2, data = headache)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.971 -2.449 -0.125  2.763  7.549 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   41.0408     6.7350   6.094 1.55e-05
## Dose          -7.0059     1.6092  -4.354 0.000493
## Sex            4.6667     5.7655   0.809 0.430149
## BP.Quan       48.6765    10.5207   4.627 0.000280
## Dose2          0.5111     0.1147   4.457 0.000397
## Dose:Sex       1.0000     0.5900   1.695 0.109448
## Dose:BP.Quan  -7.5294     1.4451  -5.210 8.59e-05
## Sex:BP.Quan  -10.0000     8.4265  -1.187 0.252656
## 
## Residual standard error: 4.213 on 16 degrees of freedom
## Multiple R-squared:  0.9418, Adjusted R-squared:  0.9164 
## F-statistic:    37 on 7 and 16 DF,  p-value: 1.022e-08

again, two interactions are not significant, so

fit <- lm(Time~.+ Dose:BP.Quan, data=headache)
summary(fit)
## 
## Call:
## lm(formula = Time ~ . + Dose:BP.Quan, data = headache)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4706 -2.1795  0.2083  2.7819  9.5490 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)   40.5408     6.5253   6.213 7.31e-06
## Dose          -6.5059     1.6792  -3.874 0.001111
## Sex            5.6667     1.8258   3.104 0.006130
## BP.Quan       43.6765    10.2329   4.268 0.000463
## Dose2          0.5111     0.1217   4.199 0.000539
## Dose:BP.Quan  -7.5294     1.5340  -4.908 0.000113
## 
## Residual standard error: 4.472 on 18 degrees of freedom
## Multiple R-squared:  0.9262, Adjusted R-squared:  0.9058 
## F-statistic: 45.21 on 5 and 18 DF,  p-value: 1.437e-09

and now all terms are significant.

What does all of this look like?

x <- seq(2, 12, length=100)
y <- 0:1
z <- c(0.25, 0.5, 0.75)
xy <- expand.grid(x, y, z)
df <- data.frame(Dose=xy[, 1], Dose2=xy[, 1]^2,
                 Sex=xy[, 2], BP.Quan=xy[, 3])
df$Time <- predict(fit, df)
ggplot(data=df, aes(Dose, Time, color=factor(Sex))) +
  geom_line() +
  facet_wrap(~factor(BP.Quan)) +
  labs(color="Gender")

and so we can give the following advice:

  • the same dosage will work for men and women
  • for people with low blood pressure give 7.5mg
  • for people with medium blood pressure give 11mg
  • for people with high blood pressure give 9mg