As we saw before, the boxplot of this data shows some severe outliers:
attach(cancersurvival)
head(cancersurvival)
## Survival Cancer
## 1 124 Stomach
## 2 42 Stomach
## 3 25 Stomach
## 4 45 Stomach
## 5 412 Stomach
## 6 51 Stomach
bplot(Survival, Cancer)
These are often an indication that there is a problem with the assumption of normaly distributed residuals. In fact, when we run the ANOVA and check the normal plot we can see that this is the case:
oneway(Survival,Cancer)
## p value of test of equal means: p = 0.000
## Smallest sd: 209.9 Largest sd : 1239
So, what can we do? One possible solution is to use a log transformation:
bplot(log(Survival), Cancer)
This takes care of (most) of the outliers.
Outliers often have another effect:
stat.table(Survival, Cancer)
## Sample Size Mean Standard Deviation
## Stomach 13 286.0 346.3
## Bronchus 17 211.6 209.9
## Colon 17 457.4 427.2
## Ovary 6 884.3 1098.6
## Breast 11 1395.9 1239.0
shows we also have a problem with the equal variance: smallest stdev=210, largest stdev=1239, 3*210=630 < 1239.
In this class we will use the log transform only. In real live there are a number of other transforms one can try, such as square root and inverse.
Note sometimes in a quantitative variable some values are 0, but log(0) does not exist!. In this case use log(x+1). Even worse, sometimes numbers are negative, and again log(negative number) does not exist. In that case use log(x+a) so that all x+a>0
We already know that outliers have a strong effect on the mean and the standard deviation. It might therefore be better to use a summary table based on median and iqr:
stat.table(Survival, Cancer, Mean=FALSE)
## Sample Size Median IQR
## Stomach 13 124 350.0
## Bronchus 17 155 173.0
## Colon 17 372 330.0
## Ovary 6 406 799.8
## Breast 11 1166 969.5
Now we can finish the analysis of this dataset:
oneway(log(Survival),Cancer)
## p value of test of equal means: p = 0.0041
## Smallest sd: 1 Largest sd : 1.6
Notice that the transformation solves both the problem of the normal residuals as well as the problem of unequal variances! This is quite often the case, though not always.
The specific capacity of wells in the Appalachian mountain region of Pennsylvania has been measured in four rock types. (Knopman 1990) The rock types are dolomite, limestone, siliclastic and metamorphic. The capacities are recorded in gal/min/ft.
attach(rocks)
head(rocks)
## Rocks Capacity
## 1 Dolomite 132.95355630
## 2 Dolomite 0.03995506
## 3 Dolomite 3.09565649
## 4 Dolomite 9.97418198
## 5 Dolomite 4.61817669
## 6 Dolomite 1.50681778
table(Rocks)
## Rocks
## Dolomite Limestone Metamorphic Siliclastic
## 50 50 50 50
bplot(Capacity, Rocks)
Clearly there some serious outliers. Let’s try the log transform:
bplot(log(Capacity), Rocks)
and this looks much better.
Summary Statistics
Because we used a transformation we will use the median and IQR
stat.table(Capacity, Rocks, Mean=FALSE)
## Sample Size Median IQR
## Dolomite 50 1.7 9.1
## Limestone 50 0.5 1.9
## Siliclastic 50 0.5 1.2
## Metamorphic 50 0.3 1.0
Note that the estimates of the variation differ by quite a lot (1.0 vs 9.1). This again is due to the fact that we have many outliers in the dataset.
Now the test:
oneway(log(Capacity), Rocks)
## p value of test of equal means: p = 0.0067
## Smallest sd: 1.4 Largest sd : 2.6
Assumptions: Normal plot looks ok
smallest stdev of log(Capacity): 0.61, largest stdev: 1.11, 3*0.61 = 18.3 > 1.11, ok
Warning
If we had not done a transformation the results would have been quite different. For example, rocks would not have been stat. significant (p-value = 0.06)
Brain and Body Weight (in kg) of 62 Mammals.
head(brainsize)
## Animal body.wt.kg brain.wt.g
## 1 African elephant 6654.000 5712.0
## 2 African giant pouched rat 1.000 6.6
## 3 Arctic Fox 3.385 44.5
## 4 Arctic ground squirrel 0.920 5.7
## 5 Asian elephant 2547.000 4603.0
## 6 Baboon 10.550 179.5
We have two quantitative variables, so we should start with the scatterplot:
attach(brainsize)
brainsize
## Animal body.wt.kg brain.wt.g
## 1 African elephant 6654.000 5712.00
## 2 African giant pouched rat 1.000 6.60
## 3 Arctic Fox 3.385 44.50
## 4 Arctic ground squirrel 0.920 5.70
## 5 Asian elephant 2547.000 4603.00
## 6 Baboon 10.550 179.50
## 7 Big brown bat 0.023 0.30
## 8 Brazilian tapir 160.000 169.00
## 9 Cat 3.300 25.60
## 10 Chimpanzee 52.160 440.00
## 11 Chinchilla 0.425 6.40
## 12 Cow 465.000 423.00
## 13 Desert hedgehog 0.550 2.40
## 14 Donkey 187.100 419.00
## 15 Eastern American mole 0.075 1.20
## 16 Echidna 3.000 25.00
## 17 European hedgehog 0.785 3.50
## 18 Galago 0.200 5.00
## 19 Genet 1.410 17.50
## 20 Giant armadillo 60.000 81.00
## 21 Giraffe 529.000 680.00
## 22 Goat 27.660 115.00
## 23 Golden hamster 0.120 1.00
## 24 Gorilla 207.000 406.00
## 25 Gray seal 85.000 325.00
## 26 Gray wolf 36.330 119.50
## 27 Ground squirrel 0.101 4.00
## 28 Guinea pig 1.040 5.50
## 29 Horse 521.000 655.00
## 30 Jaguar 100.000 157.00
## 31 Kangaroo 35.000 56.00
## 32 Lesser short-tailed shrew 0.005 0.14
## 33 Little brown bat 0.010 0.25
## 34 Man 62.000 1320.00
## 35 Mole rat 0.122 3.00
## 36 Mountain beaver 1.350 8.10
## 37 Mouse 0.023 0.40
## 38 Musk shrew 0.048 0.33
## 39 N. American opossum 1.700 6.30
## 40 Nine-banded armadillo 3.500 10.80
## 41 Okapi 250.000 490.00
## 42 Owl monkey 0.480 15.50
## 43 Patas monkey 10.000 115.00
## 44 Phanlanger 1.620 11.40
## 45 Pig 192.000 180.00
## 46 Rabbit 2.500 12.10
## 47 Raccoon 4.288 39.20
## 48 Rat 0.280 1.90
## 49 Red fox 4.235 50.40
## 50 Rhesus monkey 6.800 179.00
## 51 Rock hyrax (Hetero. b) 0.750 12.30
## 52 Rock hyrax (Procavia hab) 3.600 21.00
## 53 Roe deer 83.000 98.20
## 54 Sheep 55.500 175.00
## 55 Slow loris 1.400 12.50
## 56 Star nosed mole 0.060 1.00
## 57 Tenrec 0.900 2.60
## 58 Tree hyrax 2.000 12.30
## 59 Tree shrew 0.104 2.50
## 60 Vervet 4.190 58.00
## 61 Water opossum 3.500 3.90
## 62 Yellow-bellied marmot 4.050 17.00
splot(brain.wt.g, body.wt.kg)
unfortunately almost all the “space” in the graph is taken up by a few outliers, it is not even possible to determine if there is a relationship between the variables. Drawing the marginal plot show that the problem are outliers in both variables:
mplot(brain.wt.g, body.wt.kg)
As before we can try and fix this problem by using a log transformation:
mplot(log(brain.wt.g), log(body.wt.kg))
which nicely fixes the problem.
Because now we have two quantitative variables the log transform could be applied to x, to y or to both. In general we might see any of these combinations:
\(\rightarrow\) no transformations needed
mplot(y, x)
x variable is bad, y variable is ok
\(\rightarrow\) log transform x, leave y alone:
mplot(y, log(x))
mplot(y, x)
y variable is bad, x variable is ok
\(\rightarrow\) log transform y, leave x alone:
mplot(log(y), x)
mplot(y, x)
both x and y variables are bad
\(\rightarrow\) log transform x and y
mplot(log(y), log(x))
It is clear from the scatterplot that we have a strong linear relationship between log(Brain) and log(Body), but if we want to we can now also find Pearson’s correlation coefficient:
cor(log(body.wt.kg), log(brain.wt.g))
## [1] 0.958817
Doing so for the original data would have been wrong!