Two Categorical Predictors - One Quantitative Response

Case Study: Testing Hearing Aids

Reference: Loven, Faith. (1981). A Study of the Interlist Equivalency of the CID W-22 Word List Presented in Quiet and in Noise. Unpublished MS Thesis, University of Iowa.

Description: Percent of a Standard 50-word list heard correctly in the presence of background noise. 24 subjects with normal hearing listened to standard audiology tapes of English words at low volume with a noisy background. They repeated the words and were scored correct or incorrect in their perception of the words. The order of list presentation was randomized.

The word lists are standard audiology tools for assessing hearing. They are calibrated to be equally difficult to perceive. However, the original calibration was performed with normal-hearing subjects and no noise background. The experimenter wished to determine whether the lists were still equally difficult to understand in the presence of a noisy background.

head(hearingaid)
##   Subject List Score
## 1       1    1    28
## 2       2    1    24
## 3       3    1    32
## 4       4    1    30
## 5       5    1    34
## 6       6    1    30

Notice that the values in both Subject and List are NOT numbers but labels, so both of them are categorical!

As long as we have one quantitative response (Score) and all the predictors (factors) are categorical (Subject, List) this is still an ANOVA problem, now called a twoway ANOVA More specifically, this is a Randomized Block design with List as the factor and Subject as a blocking variable.

attach(hearingaid) 
bplot(Score, List, new_order = "Size")

bplot(Score, Subject, new_order = "Size")

The summary statistics are:

stat.table(Score, List, Sort = TRUE) 
##   Sample Size Mean Standard Deviation
## 3          24 25.2                8.3
## 4          24 25.6                7.8
## 2          24 29.7                8.1
## 1          24 32.8                7.4

Because Subject is the blocking variable one would normally not include a table of summary statistics.

Now for the test, or better tests, because we can in general test for either Subject or List. The routine we will use is called twoway:

twoway(Score, List, Subject) 

## No repeated measurement! Interaction term can not be included 
## List  p = 0.000 
## Subject  p = 0.000

So we have two tests, one for List and one for Subject. However, only the one for List is of interest:

  1. Parameters of interest: List group means
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha=0.05\)
  5. Null hypothesis H0: \(\mu_1 = .. =\mu_4\) (List groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (at least two List groups have different means)
  7. p value=0.00
  8. 0.000<0.05, there is some evidence that the group means are not the same, that List means are different)

As always we need to check the assumptions. The normal plot of residuals looks fine.

Next the equal variance. In a oneway ANOVA we could just find the group standard deviations and compare them. Now (and in general if there are more than 1 factor) this is no longer a good idea mainy because there are to many factor level combinations (4*24 here ) and not enough observations for each (1 here). Instead we will do the same as in the regression case, namely check the residual vs. fits plot for a change in spread from left to right.

again, everything looks fine.

Notice that the ANOVA table also has the test for the Subject means. This is not very interesting, the boxplot already makes it clear that different subjects have very different hearing abilities. If that were not so, we would eliminate Subject and run a oneway ANOVA. Because we now have two factors, we need to worry about an additional problem, namely whether or not there is a relationship between the two factors. This is called

Interaction

To do so we will check the interaction plot

An interaction plot looks as follows:

Here the line segments are allmost parallel. This implies that for any value of the factor A going from one value of B to the next adds the same amount to the response. So if we go from B=1 to B=2 both lines move up by about 2.0, and if we go from B=2 to B=3 both lines move down by 0.75.

Because of this we call such a model additive

Now consider the following interactions plot:

Here as we go from B=2 to B=3 the line goes up by 4 if A=1 but it goes down by 0.5 if A=1.

Here is another way of understanding the difference: Say you are told that you have an additive model and the following information:

Low High
In 2.3 3.1
Out 2.7 ?

Can we make a guess for the response if Factor 1 = “high” and Factor 2 = “out”? We see that if Factor 2 = “in” and going from “low” to “high” the response goes up by 0.4 (=2.7-2.3). In an additive model that means the response should go up the same amount for Factor 2 = “out”, that is it should go to 3.5 (=3.1+0.4).

But if there were interaction there would be no way to make any guess at all!

Deciding from the graph whether or not there is interaction is not always easy. Here are four interaction plots from a simulated data set, all guaranteed NOT to have any interaction:

This is even worse because in ANOVA problems we often have very small data sets, so there is a great amount of variation in these graphs from sample to sample.

So it would be nice if we could actually test for interaction, but that requires repeated measurements.

In the hearing aid data we only have one observation for each combination of Subject and List, so we need to decide on the basis of the interaction plot:

iplot(Score, List, Subject )

There seems to be interaction between Lists and Subjects

Finally, it would of course be interesting to study just which lists are different, that is we could do a multiple comparison:

tukey(Score, List, Subject, which="first")
## No repeated measurement!
## Interaction term can not be included
## 
## Groups that are statistically significantly different:
##   Groups p.value
## 1    1-3       0
## 2    1-4       0

so List 1 is statistically sigifcantly different from Lists 3 and 4.

No other differences are statistically significant.

Because Subject is only a blocking variable we would not due a multiple comparison for it if we wanted to we would use the command

tukey(Score, List, Subject, which="second")
## No repeated measurement!
## Interaction term can not be included
## 
## Groups that are statistically significantly different:
##    Groups p.value
## 1   15-21  0.0394
## 2   14-21  0.0063
## 3    9-21  0.0042
## 4   19-21  0.0012
## 5   24-21  0.0000
## 6   14-18  0.0394
## 7    9-18  0.0279
## 8   19-18  0.0093
## 9   24-18  0.0019
## 10   19-4  0.0195
## 11   24-4  0.0042
## 12  19-20  0.0394
## 13  24-20  0.0093
## 14  24-22  0.0135

Case Study: Gasoline Type and Milage

In an experiment to study gas milage four different blends of gasoline are tested in each of three makes of automobiles. The cars are driven a fixed distance to determine the mpg (miles per gallon) The experiment is repeated three times for each blend-automobile combination. (Taken from Lyman Ott)

Note that the interest here is indifferent gasoline blends, automobile is a blocking variable, so this is a randomized block design.

Gasoline is numbers, but these are just codes for different blends, so it is a categorical variable or factor.

attach(gasoline)
head(gasoline)
##    MPG Gasoline Automobile
## 1 22.7        1          A
## 2 22.4        1          A
## 3 22.9        1          A
## 4 21.5        2          A
## 5 21.8        2          A
## 6 21.6        2          A

Here is an interesting calculation:

table(Gasoline, Automobile)
##         Automobile
## Gasoline A B C
##        1 3 3 3
##        2 3 3 3
##        3 3 3 3
##        4 3 3 3

This shows us two things:

  1. we have repeated measurements (several observations per factor-level combination)

  2. we have a balanced design (the same number of repetitions in each factor-level combination)

This second feature used to be quite important because the calculations in a balanced design are much simpler. Nowadays with fast computers this is not important anymore. There are still good reasons why you want to design your experiment to have a balanced design if possible, though!

bplot(MPG, Gasoline, new_order = "Size")

bplot(MPG, Automobile, new_order = "Size")

the boxplots suggest a difference between blends but not between automobiles.

The summary statistics are

stat.table(MPG, Gasoline, Sort = TRUE)
##   Sample Size Mean Standard Deviation
## 4           9 20.5                0.4
## 2           9 21.2                0.4
## 3           9 21.9                0.2
## 1           9 22.8                0.4
stat.table(MPG, Automobile, Sort = TRUE)
##   Sample Size Mean Standard Deviation
## C          12 21.4                0.9
## B          12 21.7                1.0
## A          12 21.7                0.8

Interaction:

iplot(MPG, Gasoline, Automobile)

Lines are (almost) parallel, so there is no indication of interaction. We have repeated measurements (3 per factor-level combination), so we can test for this:

twoway(MPG, Gasoline, Automobile)

## Gasoline  p = 0.000 
## Automobile  p = 0.0799 
## Interaction p = 0.1854
  1. Parameters of interest: Interaction
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha\)=0.05
  5. Null hypothesis H0 : no interaction
  6. Alternative hypothesis Ha: some interaction
  7. p value = 0.1854
  8. 0.1854 > 0.05, there is no evidence of interaction.

So we will now proceed without the interaction term:

twoway(MPG, Gasoline, Automobile, with.interaction=FALSE)

## Gasoline  p = 0.000 
## Automobile  p = 0.0987

the plots look fine, so no problem with the assumptions.

Now let’s test for the factors:

Test for Factor Gasoline:

  1. Parameters of interest: means of gasoline groups
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha\) = 0.05
  5. Null hypothesis H0 : \(\mu_1 = .. = \mu_4\) (Gasoline groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (Gasoline groups have different means)
  7. p value=0.000
  8. 0.000<0.05, there is some evidence of differences in gasoline blends

Test for Factor Automobile is not really needed because this is a blocking variable.

Notice that if we included the interaction the p-value for Automobile was 0.08, without the interaction it is 0.1. One advantage of being able to fit an additive model is that often it makes the conclusions stronger.

Because automobile is not significant and there is no interaction, we can drop automobile from the analysis and treat this as a oneway ANOVA problem:

tukey(MPG, Gasoline)
## Groups that are statistically significantly different:
##   Groups p.value
## 1    2-4       0
## 2    3-4       0
## 3    1-4       0
## 4    3-2       0
## 5    1-2       0
## 6    1-3       0

so all blends are stat. significantly different, with blend 1 having the highest miles per gallon.

Case Study: Film Thickness in Semiconductor Production

Chemical vapor deposition is a process used in the semiconductor industry to deposit thin films of silicon dioxide and photoresit on substrates of wafers as they are manufactured. The films must be as thin as possible and have a uniform thickness, which is measured by a process called infrared interference. A process engineer wants to evaluate a low-pressure chemical vapor deposition process that reduces costs and increases productivity. The engineer has set up an experiment to study the effect of chamber temperature and pressure on film thickness.

attach(filmcoatings)
filmcoatings
##    Thickness Temperature Pressure
## 1         42         Low      Low
## 2         43         Low      Low
## 3         39         Low      Low
## 4         45         Low      Mid
## 5         43         Low      Mid
## 6         45         Low      Mid
## 7         45         Low     High
## 8         44         Low     High
## 9         47         Low     High
## 10        36         Mid      Low
## 11        34         Mid      Low
## 12        37         Mid      Low
## 13        39         Mid      Mid
## 14        39         Mid      Mid
## 15        37         Mid      Mid
## 16        40         Mid     High
## 17        42         Mid     High
## 18        38         Mid     High
## 19        38        High      Low
## 20        37        High      Low
## 21        37        High      Low
## 22        35        High      Mid
## 23        36        High      Mid
## 24        33        High      Mid
## 25        40        High     High
## 26        41        High     High
## 27        42        High     High
table(Temperature, Pressure)
##            Pressure
## Temperature High Low Mid
##        High    3   3   3
##        Low     3   3   3
##        Mid     3   3   3

so again we have balanced design with repetead measurements

bplot(Thickness, Temperature)

Notice that the order of the boxes is strange: High-Low-Mid. This is because R uses alphabetic ordering unless told otherise. Let’s change that:

Temperature <- change.order(Temperature, c("Low","Mid","High"))
Pressure <- change.order(Pressure, c("Low","Mid","High"))
bplot(Thickness, Temperature)

bplot(Thickness, Pressure)

Unlike in the hearing aid or gasoline experiments, here we equally interested in both factors. This type of experiment is called a factorial design problem.

For us there is no practical difference between a randomized block design and a factorial design but the distinction can be important in other analyses.

stat.table(Thickness, Temperature)
##      Sample Size Mean Standard Deviation
## Low            9 43.7                2.3
## Mid            9 38.0                2.3
## High           9 37.7                2.9
stat.table(Thickness, Pressure)
##      Sample Size Mean Standard Deviation
## Low            9 38.1                2.8
## Mid            9 39.1                4.4
## High           9 42.1                2.8

Interaction

iplot(Thickness, Temperature, Pressure)

The lines are not all parallel, so there is likely some interaction. Again we have repeated measurements (3 per factor-level combination), so we can actually test for this:

twoway(Thickness, Temperature, Pressure)

## Temperature  p = 0.000 
## Pressure  p = 0.000 
## Interaction p = 0.0124

the graphs show that there are no problems with the assumptions.

Test for Interaction:

  1. Parameters of interest: Interaction
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha = 0.05\)
  5. Null hypothesis H0 : no interaction
  6. Alternative hypothesis Ha: some interaction
  7. p value = 0.0124
  8. 0.0124<0.05, there is some evidence of interaction

Test for Factor Temperature:

  1. Parameters of interest: means of temperature groups
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha = 0.05\)
  5. Null hypothesis H0 : \(\mu_1 = \mu_2 = \mu_3\) (Temperature groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (Temperature groups have different means)
  7. p value = 0.000
  8. 0.000 < 0.05, there is some evidence of differences in temperature

Test for Factor Pressure:

  1. Parameters of interest: means of pressure groups
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha = 0.05\)
  5. Null hypothesis H0 : \(\mu_1 = \mu_2 = \mu_3\) (Pressure groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (Pressure groups have different means)
  7. p value = 0.000
  8. 0.000<0.05, there is some evidence of differences in pressure

Finally, what we need is to find the best combination of pressure and temperature. So what we want is a multiple comparison for Temperature and Pressure (not either of them alone!). Easily done:

tukey(Thickness, Temperature, Pressure, which="interaction")
## Groups that are statistically significantly different:
##                Groups p.value
## 1    Low:Mid-High:Low  0.0000
## 2   Low:High-High:Low  0.0000
## 3     Low:Low-Mid:Low  0.0041
## 4     Low:Mid-Mid:Low  0.0000
## 5   High:High-Mid:Low  0.0072
## 6    Mid:High-Mid:Low  0.0397
## 7    Low:High-Mid:Low  0.0000
## 8    High:Mid-Low:Low  0.0000
## 9    Low:Mid-High:Mid  0.0000
## 10 High:High-High:Mid  0.0013
## 11  Mid:High-High:Mid  0.0072
## 12  Low:High-High:Mid  0.0000
## 13    Low:Mid-Mid:Mid  0.0023
## 14   Low:High-Mid:Mid  0.0000
## 15   Mid:High-Low:Mid  0.0397
## 16 Low:High-High:High  0.0397
## 17  Low:High-Mid:High  0.0072

This is bit hard to read. In the past we have used the tapply command to sort the groups by their means. We want to do the same here but first we need to make a new variable that combines the Temperature and the Pressure:

TP <- paste0(Temperature,"-",Pressure)
TP
##  [1] "Low-Low"   "Low-Low"   "Low-Low"   "Low-Mid"   "Low-Mid"  
##  [6] "Low-Mid"   "Low-High"  "Low-High"  "Low-High"  "Mid-Low"  
## [11] "Mid-Low"   "Mid-Low"   "Mid-Mid"   "Mid-Mid"   "Mid-Mid"  
## [16] "Mid-High"  "Mid-High"  "Mid-High"  "High-Low"  "High-Low" 
## [21] "High-Low"  "High-Mid"  "High-Mid"  "High-Mid"  "High-High"
## [26] "High-High" "High-High"
sort(round(tapply(Thickness,TP,mean), 1))
##  High-Mid   Mid-Low  High-Low   Mid-Mid  Mid-High High-High   Low-Low 
##      34.7      35.7      37.3      38.3      40.0      41.0      41.3 
##   Low-Mid  Low-High 
##      44.3      45.3

so we see that that the combination Temperature=High, Pressure=Mid is best.

But tukey tells us that it is not stat. significantly better than either of the next three combinations ( Mid Low, High Low or Mid Mid), at least not at these sample sizes.

Remember we made some new variables. If we are sure we won’t need them anymore we should

rm(Temperature)
rm(Pressure) 
rm(TP)

A simple idea for solving this problem seems to be this one:

  1. find the best temperature:
sort(round(tapply(Thickness, Temperature, mean), 1))
## High  Mid  Low 
## 37.7 38.0 43.7

so Temperature=High is best

  1. find the best pressure:
sort(round(tapply(Thickness, Pressure, mean), 1))
##  Low  Mid High 
## 38.1 39.1 42.1

so Pressure=Low is best

  1. take the combination: Pressure=Low, Temperature=High is best! Except it is not: we saw before that Pressure=Mid, Temperature=High is best.

This simple idea does not work because of the presence of interaction.

Case Study: Water Quality and Mining

The effects of mining and rock type on water quality.

attach(mines)
head(mines)
##        Rock    Mine Iron
## 1 Sandstone Unmined 0.20
## 2 Sandstone Unmined 0.25
## 3 Sandstone Unmined 0.04
## 4 Sandstone Unmined 0.06
## 5 Sandstone Unmined 1.20
## 6 Sandstone Unmined 0.30
table(Rock, Mine)
##            Mine
## Rock        Abandoned Reclaimed Unmined
##   Limestone        13        13      13
##   Sandstone        13        13      13
bplot(Iron, Rock, new_order = "Size")

bplot(Iron, Mine)

We have a clear problem with the normal assumption, so use the log transform

bplot(log(Iron), Rock, new_order = "Size")

bplot(log(Iron), Mine)

This has solved the problem, so the analysis will be based on log(Iron)

Summary Statistics

Because we use a transformation we will base the tables on Median and IQR

stat.table(Iron, Rock, Mean=FALSE, Sort = TRUE)
##           Sample Size Median IQR
## Sandstone          39    0.4 1.4
## Limestone          39    1.3 3.6
stat.table(Iron, Mine, Mean=FALSE)
##           Sample Size Median  IQR
## Unmined            26    0.5  2.7
## Reclaimed          26    0.7  0.8
## Abandoned          26    1.6 10.2

Note that the IQR’s are very different. This is because this data set has a lot of outliers which still effect the IQR.

Interaction

iplot(log(Iron), Rock, Mine)

There seems to be some interaction. To confirm this test for it:

twoway(log(Iron), Rock, Mine)

## Rock  p = 0.0348 
## Mine  p = 0.000 
## Interaction p = 0.000

assumptions are ok (after log transform!)

Test for Interaction:

  1. Parameters of interest: Interaction
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha = 0.05\)
  5. Null hypothesis H0 : no interaction
  6. Alternative hypothesis Ha: some interaction
  7. p value = 0.000
  8. 0.000<0.05, there is some evidence of interaction
    Check the assumptions of ANOVA: both plots look ok

Test for Factor Rock:

  1. Parameters of interest: means of pressure groups
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha = 0.05\)
  5. Null hypothesis H0 : \(\mu_1~= \mu_2\) (Rock groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_1 \ne \mu_2\) (Rock groups have different means)
  7. p value = 0.035
  8. 0.035<0.05, there is some evidence of differences in Rock types.

Test for Factor Mine:

  1. Parameters of interest: means of pressure groups
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. Type I error probability \(\alpha = 0.05\)
  5. Null hypothesis H0 : \(\mu_1 = \mu_2 = \mu_3\) (Mine groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (Mine groups have different means)
  7. p value = 0.000
  8. 0.000<0.05, there is some evidence of differences in Mine types

Multiple Comparison The main interest is in mines, so

tukey(log(Iron), Rock, Mine, which="second")
## Groups that are statistically significantly different:
##                Groups p.value
## 1 Abandoned-Reclaimed       0
## 2   Abandoned-Unmined       0

Interpretation: There is a stat. signif. difference between the mean iron content of abondoned mines and the others. The difference between unmined and reclaimed mines is not stat. sign, at least not at these sample sizes.