multiple.predictors

Multiple Predictors

ANOVA

Case Study: Gasoline Type and Mileage

In an experiment to study gas mileage four different blends of gasoline are tested in each of three makes of automobiles. The cars are driven a fixed distance to determine the mpg (miles per gallon) The experiment is repeated three times for each blend-automobile combination. (Taken from Lyman Ott)

Note that the interest here is indifferent gasoline blends, automobile is a blocking variable, so this is a randomized block design.

Gasoline is numbers, but these are just codes for different blends, so it is a categorical variable or factor.

gasoline$Gasoline <- factor(gasoline$Gasoline,
                            levels = c(1, 3, 2, 4),
                            ordered = TRUE)
gasoline <- as.data.frame(gasoline)

attach(gasoline)
kable.nice(head(gasoline), do.row.names = FALSE)

MPG	Gasoline	Automobile
22.7	1	A
22.4	1	A
22.9	1	A
21.5	2	A
21.8	2	A
21.6	2	A

Here is an interesting calculation:

kable.nice(table(Gasoline, Automobile))

	A	B	C
1	3	3	3
3	3	3	3
2	3	3	3
4	3	3	3

This shows us two things:

we have repeated measurements (several observations per factor-level combination)
we have a balanced design (the same number of repetitions in each factor-level combination)

This second feature used to be quite important because the calculations in a balanced design are much simpler. Nowadays with fast computers this is not important anymore. There are still good reasons why you want to design your experiment to have a balanced design if possible, though!

ggplot(gasoline, aes(Gasoline, MPG)) +
  geom_boxplot()

ggplot(gasoline, aes(Automobile, MPG)) +
  geom_boxplot()

the boxplots suggest a difference between blends but not between automobiles.

The summary statistics are

means <- round(tapply(MPG, Gasoline, mean), 1)
sds <- round(tapply(MPG, Gasoline, sd), 2)
kable.nice(data.frame(Mean=means, 
                      Std=sds))

	Mean	Std
1	22.8	0.36
3	21.9	0.25
2	21.2	0.37
4	20.5	0.36

Interaction:

In the case of two or more predictors (factors) we also need to worry about relationships between them. This is called interaction.

One way to study this issue is as follows:

find the means of MPG for each factor-level combination
plot these points with one factor on the x axis
connect the points according to the other factor

ggplot(data = gasoline,
       aes(Automobile, MPG,
           colour = Gasoline, group=Gasoline)) +
      stat_summary(fun.y=mean, geom="point")+
      stat_summary(fun.y=mean, geom="line")

if there is no interaction, the line segments should be fairly parallel. This is the case here, so there is no indication of interaction.

We have repeated measurements (3 per factor-level combination), so we can test also test for this:

aov(MPG~Gasoline*Automobile, data=gasoline)

## Call:
##    aov(formula = MPG ~ Gasoline * Automobile, data = gasoline)
## 
## Terms:
##                  Gasoline Automobile Gasoline:Automobile Residuals
## Sum of Squares  25.405278   0.526667            0.908889  2.246667
## Deg. of Freedom         3          2                   6        24
## 
## Residual standard error: 0.3059593
## Estimated effects may be unbalanced

Parameters of interest: Interaction
Method of analysis: ANOVA
Assumptions of Method: residuals have a normal distribution, groups have equal variance
Type I error probability \(\alpha\)=0.05
Null hypothesis H₀ : no interaction
Alternative hypothesis H_a: some interaction
p value = 0.1854
0.1854 > 0.05, there is no evidence of interaction.

So we will now proceed without the interaction term:

aov(MPG~Gasoline+Automobile, data=gasoline)

## Call:
##    aov(formula = MPG ~ Gasoline + Automobile, data = gasoline)
## 
## Terms:
##                  Gasoline Automobile Residuals
## Sum of Squares  25.405278   0.526667  3.155556
## Deg. of Freedom         3          2        30
## 
## Residual standard error: 0.3243227
## Estimated effects may be unbalanced

the assumptions are the same as those of oneway anova, they are fine here.

Now let’s test for the factors:

Test for Factor Gasoline:

Parameters of interest: means of gasoline groups
Method of analysis: ANOVA
Assumptions of Method: residuals have a normal distribution, groups have equal variance
Type I error probability \(\alpha\) = 0.05
Null hypothesis H₀ : \(\mu_1 = .. = \mu_4\) (Gasoline groups have the same means)
Alternative hypothesis H_a: \(\mu_i \ne \mu_j\) (Gasoline groups have different means)
p value=0.000
0.000<0.05, there is some evidence of differences in gasoline blends

Test for Factor Automobile is not really needed because this is a blocking variable.

Notice that if we included the interaction the p-value for Automobile was 0.08, without the interaction it is 0.1. One advantage of being able to fit an additive model is that often it makes the conclusions stronger.

Regression

Case Study: House Prices

Prices of residencies located 30 miles south of a large metropolitan area with several possible predictor variables.

attach(houseprice)
kable.nice(houseprice, do.row.names = FALSE)

Price	Sqfeet	Floors	Bedrooms	Baths
69.0	1500.000	1	2	1.0
118.5	1880.952	1	2	2.0
104.0	1976.190	1	3	2.0
116.5	1880.952	1	3	2.0
121.5	1880.952	1	3	2.0
125.0	1976.190	1	3	2.0
128.0	2357.143	2	3	2.5
129.9	2166.667	1	3	1.7
133.0	2166.667	2	3	2.5
135.0	2166.667	2	3	2.5
137.5	2357.143	2	3	2.5
139.9	2166.667	1	3	2.0
143.9	2261.905	2	3	2.5
147.9	2547.619	2	3	2.5
154.9	2357.143	2	3	2.5
160.0	2738.095	2	3	2.0
169.0	2357.143	1	3	2.0
169.9	2642.857	1	3	2.0
125.0	2166.667	1	4	2.0
134.9	2166.667	1	4	2.0
139.9	2547.619	1	4	2.0
147.0	2642.857	1	4	2.0
159.0	2261.905	1	4	2.0
169.9	2547.619	2	4	3.0
178.9	2738.095	1	4	2.0
194.5	2833.333	2	4	3.0
219.9	2928.571	1	4	2.5
269.0	3309.524	2	4	3.0

Let’s go through the list of predictors one by one:

ggplot(houseprice, aes(Sqfeet, Price)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE)

cor(Price, Sqfeet)

## [1] 0.9152079

ggplot(houseprice, aes(factor(Floors), Price)) +
  geom_boxplot() +
  labs(x="Floors")

cor(Price, Floors)

## [1] 0.2910762

Note we used the boxplot here although Floors is a quantitative predictor. If the predictor has only a few different values (2 here!) this is often a better choice.

ggplot(houseprice, aes(factor(Bedrooms), Price)) +
  geom_boxplot() +
  labs(x="Bedrooms")

cor(Price, Bedrooms)

## [1] 0.6045036

ggplot(houseprice, aes(factor(Baths), Price)) +
  geom_boxplot() +
 labs(x="Baths")

cor(Price, Baths)

## [1] 0.6525626

Now to run the regression:

fit <- lm(Price~., data=houseprice) 
summary(fit)

## 
## Call:
## lm(formula = Price ~ ., data = houseprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.018  -5.943   1.860   5.947  30.955 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.61984   17.70818  -3.819 0.000882
## Sqfeet        0.08571    0.01076   7.966 4.62e-08
## Floors      -26.49306    9.48952  -2.792 0.010363
## Bedrooms     -9.28622    6.82985  -1.360 0.187121
## Baths        37.38067   12.26436   3.048 0.005709
## 
## Residual standard error: 13.71 on 23 degrees of freedom
## Multiple R-squared:  0.8862, Adjusted R-squared:  0.8665 
## F-statistic:  44.8 on 4 and 23 DF,  p-value: 1.558e-10

Notice that there is something very strange about this model!

round(cor(houseprice[,-1]), 3)

##          Sqfeet Floors Bedrooms Baths
## Sqfeet    1.000  0.370    0.652 0.628
## Floors    0.370  1.000   -0.018 0.743
## Bedrooms  0.652 -0.018    1.000 0.415
## Baths     0.628  0.743    0.415 1.000

The highest correlation between predictors is r=0.743 (Floors-Baths)

Variable Selection

Generally the fewer predictor we have in a model, the easier it is to understand/use, so

can we eliminate any of our predictors without making the model (stat. signif.) worse?

There are several things one can think of:

Choose based on R²

but we already know this will always lead to the model with all predictors, for the same reason that a cubic model always has an R² at least as high as the quadratic model.

Note:

Price by Sqfeet, Floors and Bedrooms: R²=80.1%
Price by Floors, Bedrooms and Baths: R²=68.4%
Price by Sqfeet, Bedrooms and Baths: R²=83.5%
Price by Sqfeet, Floors, Bedrooms and Baths: R²=88.2%

so the model with all 4 has a higher R² than any of the models with just 3,

but this will always be so, even if one of the predictors is completely useless.

Choose based on Hypothesis Tests

In the output of the fit object above we see that the p_value of Bedrooms = 0.187121 > 0.05, so eliminate Bedrooms.

This sounds like a good idea AND IT IS WIDELY USED IN REAL LIFE, but it turns out to be a bad one ! The reason why is bit hard to explain, though.

What we need is new idea:

Best Subset Regression and Mallow’s C_p

We will find ALL possible models and calculate Mallow’s C_p statistic for each. The model with the lowest C_p is best.

library(leaps)
leaps(houseprice[, -1], Price, method="Cp", nbest=1)

## $which
##      1     2     3     4
## 1 TRUE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE  TRUE
## 3 TRUE  TRUE FALSE  TRUE
## 4 TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## 
## $size
## [1] 2 3 4 5
## 
## $Cp
## [1] 8.834171 8.812489 4.848657 5.000000

so the best model has Cp=4.85 and uses Sqfeet, Floors and Baths.

To find the model we rerun lm, now without Bedrooms:

fit <- lm(Price~. - Bedrooms, data=houseprice) 
summary(fit)

## 
## Call:
## lm(formula = Price ~ . - Bedrooms, data = houseprice)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.939  -7.114   2.115   5.660  33.351 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -73.92954   17.38881  -4.252 0.000279
## Sqfeet        0.07772    0.00917   8.475 1.12e-08
## Floors      -19.67975    8.19979  -2.400 0.024508
## Baths        30.57922   11.39408   2.684 0.012980
## 
## Residual standard error: 13.95 on 24 degrees of freedom
## Multiple R-squared:  0.8771, Adjusted R-squared:  0.8617 
## F-statistic: 57.09 on 3 and 24 DF,  p-value: 4.505e-11