Non-Normal Residuals, No Equal Variance - Transformations

Categorical - Quantitative

Case Study: Cancer Survival

As we saw before, the boxplot of this data shows some severe outliers:

attach(cancersurvival)
head(cancersurvival)
##   Survival  Cancer
## 1      124 Stomach
## 2       42 Stomach
## 3       25 Stomach
## 4       45 Stomach
## 5      412 Stomach
## 6       51 Stomach
bplot(Survival, Cancer)

These are often an indication that there is a problem with the assumption of normaly distributed residuals. In fact, when we run the ANOVA and check the normal plot we can see that this is the case:

oneway(Survival,Cancer)

## p value of test of equal means: p = 0.000 
## Smallest sd:  209.9    Largest sd : 1239

So, what can we do? One possible solution is to use a log transformation:

bplot(log(Survival), Cancer)

This takes care of (most) of the outliers.

Outliers often have another effect:

stat.table(Survival, Cancer) 
##          Sample Size   Mean Standard Deviation
## Stomach           13  286.0              346.3
## Bronchus          17  211.6              209.9
## Colon             17  457.4              427.2
## Ovary              6  884.3             1098.6
## Breast            11 1395.9             1239.0

shows we also have a problem with the equal variance: smallest stdev=210, largest stdev=1239, 3*210=630 < 1239.


In this class we will use the log transform only. In real live there are a number of other transforms one can try, such as square root and inverse.

Note sometimes in a quantitative variable some values are 0, but log(0) does not exist!. In this case use log(x+1). Even worse, sometimes numbers are negative, and again log(negative number) does not exist. In that case use log(x+a) so that all x+a>0


We already know that outliers have a strong effect on the mean and the standard deviation. It might therefore be better to use a summary table based on median and iqr:

stat.table(Survival, Cancer, Mean=FALSE)
##          Sample Size Median   IQR
## Stomach           13    124 350.0
## Bronchus          17    155 173.0
## Colon             17    372 330.0
## Ovary              6    406 799.8
## Breast            11   1166 969.5

Now we can finish the analysis of this dataset:

oneway(log(Survival),Cancer) 

## p value of test of equal means: p = 0.0041 
## Smallest sd:  1    Largest sd : 1.6
  1. Parameters of interest: group means
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. \(\alpha = 0.05\)
  5. Null hypothesis H0: \(\mu_1 = \mu_2 = \mu_3 = \mu_4 = \mu_5\) (groups have the same means)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (at least two groups have different means)
  7. p value = 0.0041
  8. 0.0041 < 0.05, there is some evidence that the group means are not the same, there are differences in the survival times. Assumptions: a normal plot of residuals ok
    b smallest stdev = 1.0, largest stdev = 1.6, 3*1.0 = 3.0 > 1.6 ok

Notice that the transformation solves both the problem of the normal residuals as well as the problem of unequal variances! This is quite often the case, though not always.

Case Study: Capacity of Wells

The specific capacity of wells in the Appalachian mountain region of Pennsylvania has been measured in four rock types. (Knopman 1990) The rock types are dolomite, limestone, siliclastic and metamorphic. The capacities are recorded in gal/min/ft.

attach(rocks)
head(rocks)
##      Rocks     Capacity
## 1 Dolomite 132.95355630
## 2 Dolomite   0.03995506
## 3 Dolomite   3.09565649
## 4 Dolomite   9.97418198
## 5 Dolomite   4.61817669
## 6 Dolomite   1.50681778
table(Rocks)
## Rocks
##    Dolomite   Limestone Metamorphic Siliclastic 
##          50          50          50          50
bplot(Capacity, Rocks)

Clearly there some serious outliers. Let’s try the log transform:

bplot(log(Capacity), Rocks)  

and this looks much better.

Summary Statistics

Because we used a transformation we will use the median and IQR

stat.table(Capacity, Rocks, Mean=FALSE)
##             Sample Size Median IQR
## Dolomite             50    1.7 9.1
## Limestone            50    0.5 1.9
## Siliclastic          50    0.5 1.2
## Metamorphic          50    0.3 1.0

Note that the estimates of the variation differ by quite a lot (1.0 vs 9.1). This again is due to the fact that we have many outliers in the dataset.

Now the test:

oneway(log(Capacity), Rocks)

## p value of test of equal means: p = 0.0067 
## Smallest sd:  1.4    Largest sd : 2.6
  1. Parameters of interest: group means
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, groups have equal variance
  4. \(\alpha = 0.05\)
  5. H0: \(\mu_1 = \mu_2 = \mu_3 = \mu_4\) (no difference in the mean Capacity for different Rocks)
  6. Ha: \(\mu_i \ne \mu_j\) for some i and j (some differences in the mean Capacitys for different Rocks)
  7. p-value = 0.0067
  8. p < \(\alpha\), we reject H0, there are some differences in the mean Capacity for different Rocks.

Assumptions: Normal plot looks ok

smallest stdev of log(Capacity): 0.61, largest stdev: 1.11, 3*0.61 = 18.3 > 1.11, ok

Warning

If we had not done a transformation the results would have been quite different. For example, rocks would not have been stat. significant (p-value = 0.06)

Quantitative - Quantitative

Case Study: Brain and Body Weight of 62 Mammals

Brain and Body Weight (in kg) of 62 Mammals.

head(brainsize)
##                      Animal body.wt.kg brain.wt.g
## 1          African elephant   6654.000     5712.0
## 2 African giant pouched rat      1.000        6.6
## 3                Arctic Fox      3.385       44.5
## 4    Arctic ground squirrel      0.920        5.7
## 5            Asian elephant   2547.000     4603.0
## 6                    Baboon     10.550      179.5

We have two quantitative variables, so we should start with the scatterplot:

attach(brainsize)
brainsize
##                       Animal body.wt.kg brain.wt.g
## 1           African elephant   6654.000    5712.00
## 2  African giant pouched rat      1.000       6.60
## 3                 Arctic Fox      3.385      44.50
## 4     Arctic ground squirrel      0.920       5.70
## 5             Asian elephant   2547.000    4603.00
## 6                     Baboon     10.550     179.50
## 7              Big brown bat      0.023       0.30
## 8            Brazilian tapir    160.000     169.00
## 9                        Cat      3.300      25.60
## 10                Chimpanzee     52.160     440.00
## 11                Chinchilla      0.425       6.40
## 12                       Cow    465.000     423.00
## 13           Desert hedgehog      0.550       2.40
## 14                    Donkey    187.100     419.00
## 15     Eastern American mole      0.075       1.20
## 16                   Echidna      3.000      25.00
## 17         European hedgehog      0.785       3.50
## 18                    Galago      0.200       5.00
## 19                     Genet      1.410      17.50
## 20           Giant armadillo     60.000      81.00
## 21                   Giraffe    529.000     680.00
## 22                      Goat     27.660     115.00
## 23            Golden hamster      0.120       1.00
## 24                   Gorilla    207.000     406.00
## 25                 Gray seal     85.000     325.00
## 26                 Gray wolf     36.330     119.50
## 27           Ground squirrel      0.101       4.00
## 28                Guinea pig      1.040       5.50
## 29                     Horse    521.000     655.00
## 30                    Jaguar    100.000     157.00
## 31                  Kangaroo     35.000      56.00
## 32 Lesser short-tailed shrew      0.005       0.14
## 33          Little brown bat      0.010       0.25
## 34                       Man     62.000    1320.00
## 35                  Mole rat      0.122       3.00
## 36           Mountain beaver      1.350       8.10
## 37                     Mouse      0.023       0.40
## 38                Musk shrew      0.048       0.33
## 39       N. American opossum      1.700       6.30
## 40     Nine-banded armadillo      3.500      10.80
## 41                     Okapi    250.000     490.00
## 42                Owl monkey      0.480      15.50
## 43              Patas monkey     10.000     115.00
## 44                Phanlanger      1.620      11.40
## 45                       Pig    192.000     180.00
## 46                    Rabbit      2.500      12.10
## 47                   Raccoon      4.288      39.20
## 48                       Rat      0.280       1.90
## 49                   Red fox      4.235      50.40
## 50             Rhesus monkey      6.800     179.00
## 51    Rock hyrax (Hetero. b)      0.750      12.30
## 52 Rock hyrax (Procavia hab)      3.600      21.00
## 53                  Roe deer     83.000      98.20
## 54                     Sheep     55.500     175.00
## 55                Slow loris      1.400      12.50
## 56           Star nosed mole      0.060       1.00
## 57                    Tenrec      0.900       2.60
## 58                Tree hyrax      2.000      12.30
## 59                Tree shrew      0.104       2.50
## 60                    Vervet      4.190      58.00
## 61             Water opossum      3.500       3.90
## 62     Yellow-bellied marmot      4.050      17.00
splot(brain.wt.g, body.wt.kg)

unfortunately almost all the “space” in the graph is taken up by a few outliers, it is not even possible to determine if there is a relationship between the variables. Drawing the marginal plot show that the problem are outliers in both variables:

mplot(brain.wt.g, body.wt.kg)

As before we can try and fix this problem by using a log transformation:

mplot(log(brain.wt.g), log(body.wt.kg))

which nicely fixes the problem.


Because now we have two quantitative variables the log transform could be applied to x, to y or to both. In general we might see any of these combinations:

\(\rightarrow\) no transformations needed


mplot(y, x)

x variable is bad, y variable is ok

\(\rightarrow\) log transform x, leave y alone:

mplot(y, log(x))


mplot(y, x)

y variable is bad, x variable is ok

\(\rightarrow\) log transform y, leave x alone:

mplot(log(y), x)


mplot(y, x)

both x and y variables are bad

\(\rightarrow\) log transform x and y

mplot(log(y), log(x))



It is clear from the scatterplot that we have a strong linear relationship between log(Brain) and log(Body), but if we want to we can now also find Pearson’s correlation coefficient:

cor(log(body.wt.kg), log(brain.wt.g))
## [1] 0.958817

Doing so for the original data would have been wrong!