Many classic methods of analysis (t tests, F tests etc) have an assumption of data that comes from a normal distribution. What can we do if we don’t have that? There are two ways to proceed:
A data transformation is any mathematical function applied to the data. Sometimes such a transformation can be used to make distribution more normal:
Consider the data set brainsize, which has the weights of the body (in kg) and of the brain (in gram) of 62 mammals:
kable.nice(brainsize)
Animal | body.wt.kg | brain.wt.g |
---|---|---|
African elephant | 6654.000 | 5712.00 |
African giant pouched rat | 1.000 | 6.60 |
Arctic Fox | 3.385 | 44.50 |
Arctic ground squirrel | 0.920 | 5.70 |
Asian elephant | 2547.000 | 4603.00 |
Baboon | 10.550 | 179.50 |
Big brown bat | 0.023 | 0.30 |
Brazilian tapir | 160.000 | 169.00 |
Cat | 3.300 | 25.60 |
Chimpanzee | 52.160 | 440.00 |
Chinchilla | 0.425 | 6.40 |
Cow | 465.000 | 423.00 |
Desert hedgehog | 0.550 | 2.40 |
Donkey | 187.100 | 419.00 |
Eastern American mole | 0.075 | 1.20 |
Echidna | 3.000 | 25.00 |
European hedgehog | 0.785 | 3.50 |
Galago | 0.200 | 5.00 |
Genet | 1.410 | 17.50 |
Giant armadillo | 60.000 | 81.00 |
Giraffe | 529.000 | 680.00 |
Goat | 27.660 | 115.00 |
Golden hamster | 0.120 | 1.00 |
Gorilla | 207.000 | 406.00 |
Gray seal | 85.000 | 325.00 |
Gray wolf | 36.330 | 119.50 |
Ground squirrel | 0.101 | 4.00 |
Guinea pig | 1.040 | 5.50 |
Horse | 521.000 | 655.00 |
Jaguar | 100.000 | 157.00 |
Kangaroo | 35.000 | 56.00 |
Lesser short-tailed shrew | 0.005 | 0.14 |
Little brown bat | 0.010 | 0.25 |
Man | 62.000 | 1320.00 |
Mole rat | 0.122 | 3.00 |
Mountain beaver | 1.350 | 8.10 |
Mouse | 0.023 | 0.40 |
Musk shrew | 0.048 | 0.33 |
N. American opossum | 1.700 | 6.30 |
Nine-banded armadillo | 3.500 | 10.80 |
Okapi | 250.000 | 490.00 |
Owl monkey | 0.480 | 15.50 |
Patas monkey | 10.000 | 115.00 |
Phanlanger | 1.620 | 11.40 |
Pig | 192.000 | 180.00 |
Rabbit | 2.500 | 12.10 |
Raccoon | 4.288 | 39.20 |
Rat | 0.280 | 1.90 |
Red fox | 4.235 | 50.40 |
Rhesus monkey | 6.800 | 179.00 |
Rock hyrax (Hetero. b) | 0.750 | 12.30 |
Rock hyrax (Procavia hab) | 3.600 | 21.00 |
Roe deer | 83.000 | 98.20 |
Sheep | 55.500 | 175.00 |
Slow loris | 1.400 | 12.50 |
Star nosed mole | 0.060 | 1.00 |
Tenrec | 0.900 | 2.60 |
Tree hyrax | 2.000 | 12.30 |
Tree shrew | 0.104 | 2.50 |
Vervet | 4.190 | 58.00 |
Water opossum | 3.500 | 3.90 |
Yellow-bellied marmot | 4.050 | 17.00 |
Let’s say we want to find a \(95\%\) confidence interval for the mean body weight. If we want to use the t.test method we need to check normality:
plt1 <- qplot(data=brainsize, sample=body.wt.kg) +
stat_qq() + stat_qq_line()
plt2 <- ggplot(data=brainsize , aes(1, body.wt.kg)) +
geom_boxplot()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt2,
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt1 ,
vp=viewport(layout.pos.row=1, layout.pos.col=2))
and this is clearly non-normal. However
brainsize$logbody <- log(brainsize$body.wt.kg)
plt1 <- qplot(data=brainsize, sample=logbody) +
stat_qq() + stat_qq_line()
plt2 <- ggplot(data=brainsize , aes(1, logbody)) +
geom_boxplot()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt2,
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt1 ,
vp=viewport(layout.pos.row=1, layout.pos.col=2))
shows that \(\log\)(body.wt.kg) is indeed normally distributed. So now
round(as.numeric(t.test(brainsize$logbody)$conf.int), 3)
## [1] 0.567 2.163
but this is confidence interval for \(\log\)(body.wt.kg), we want one for body.wt.kg. Now
\[ 1-\alpha=P \left( L( \log X ) < \theta < U(\log X) \right) =\\ P \left( \exp \{ L( \log X )\} < \exp \{\theta\} < \exp \{U(\log X)\} \right) \\ \] and so the question is whether
\[ E[X] = \mu = \exp \{\theta\} = \exp \{E[\log X]\} \] and in general the answer is no.
Above we used a log transform. In principle any function might work, and there is even a way to pick the best from a list: In 1964 Box and Cox suggested a family of transformations of the form
\[ T_{\lambda }(x)=\left\{ \begin{array}{cc} \frac{x^{\lambda }-1}{\lambda } & \lambda \neq 0 \\ \log x & \lambda =0 \end{array} \right. \] Notice that this is continuous in \(\lambda\): \(\lim_{\lambda \rightarrow 0} T_\lambda (x)=T_0(x)\) and it includes \(1/x, \sqrt{x}, x^k\) etc. To pick the best use
library(MASS)
fit <- lm(brainsize$body.wt.kg~rep(1, 62))
boxcox(fit, lambda=seq(-0.25, 0.25, 0.01))
the vertical lines give a confidence interval for \(\lambda\), and any value inside the interval is acceptable. It seems that for our data \(\lambda=0\) or the log transform is indeed appropriate.
Note that the boxcox command requires a fit object generated by lm or aov but can also be used for single vectors as above.
Name | First Nobel Price | Second Nobel Price |
---|---|---|
Linus Pauling | Chemistry 1954 (Studies Of Molecular Structure And The Chemical Bond) | Peace 1962 (Fight Against Atomic Testing) |
Frederick Sanger | Chemistry 1958 (Determining Structure Of Insulin Molecule) | Chemistry 1980 (Biochemical Studies Of Nucleic Acids) |
Marie Sklodowska Curie | Physics 1903 (Discovery Of Radioactivity In Uranium) | Chemistry 1911 (Discovery of the elements radium and polonium) |
John Bardeen | Physics 1952 (Development of the transistor effect) | Physics 1972 (Theory of Superconductivity |
Patients with advanced cancers of the stomach, bronchus, colon, ovary or breast were treated with ascorbate. The purpose of the study was to determine if patient survival differed with respect to the organ affected by the cancer.
Cameron, E. and Pauling, L. (1978) Supplemental ascorbate in the supportive treatment of cancer: re-evaluation of prolongation of survival times in terminal human cancer. Proceedings of the National Academy of Science USA, 75, 4538-4542.
head(cancersurvival)
## Survival Cancer
## 1 124 Stomach
## 2 42 Stomach
## 3 25 Stomach
## 4 45 Stomach
## 5 412 Stomach
## 6 51 Stomach
table(cancersurvival$Cancer)
##
## Breast Bronchus Colon Ovary Stomach
## 11 17 17 6 13
Let’s order the cancers by means:
mn <- sort(tapply(cancersurvival$Survival,
cancersurvival$Cancer, mean))
cancersurvival$Cancer <- factor(cancersurvival$Cancer,
levels = unique(names(mn)),
ordered = TRUE)
ggplot(data=cancersurvival , aes(Cancer, Survival)) +
geom_boxplot()
there are a number of outliers, so we need a transformation. Let’s see:
fit <- aov(Survival~Cancer, data=cancersurvival)
boxcox(fit, lambda=seq(-0.5, 0.5, 0.1))
so again a log transform can work
cancersurvival$logsurvival <- log(cancersurvival$Survival)
ggplot(data=cancersurvival , aes(Cancer, logsurvival)) +
geom_boxplot()
The table of summary statistics can be done in two ways: based on the original data and using median and IQR or the transformed data using mean and standard deviation.
out <- matrix(0, 5, 3)
colnames(out) <- c("n", "Median", "IQR")
rownames(out) <- as.character(levels(cancersurvival$Cancer))
out[, 1] <- tapply(cancersurvival$Survival,
cancersurvival$Cancer, length)
out[, 2] <- round(tapply(cancersurvival$Survival,
cancersurvival$Cancer, mean), 2)
out[, 3] <- round(tapply(cancersurvival$Survival,
cancersurvival$Cancer, sd), 2)
kable.nice(out)
n | Median | IQR | |
---|---|---|---|
Bronchus | 17 | 211.59 | 209.86 |
Stomach | 13 | 286.00 | 346.31 |
Colon | 17 | 457.41 | 427.17 |
Ovary | 6 | 884.33 | 1098.58 |
Breast | 11 | 1395.91 | 1238.97 |
Now for the test:
fit <- aov(logsurvival~Cancer, data=cancersurvival)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## Cancer 4 24.49 6.122 4.286 0.00412
## Residuals 59 84.27 1.428
p-value=0.0041, so we reject the null hypothesis, there are differences in the mean survival times.
check assumptions:
df <- data.frame(Residuals=resid(fit))
ggplot(data=df, aes(sample=Residuals)) +
geom_qq() + geom_qq_line()
ok
round(sort(tapply(cancersurvival$logsurvival,
cancersurvival$Cancer, mean))[c(1, 5)], 1)
## Bronchus Breast
## 5.0 6.6
ok
and pairwise comparisons:
tuk <- TukeyHSD(fit)
names(tuk[[1]][tuk[[1]][,4]<0.05, 4])
## [1] "Breast-Bronchus" "Breast-Stomach"
A different way to proceed is to use a method that does not require a normal distribution. These are so called nonparametric methods. A number of them have been developed over the years as alternatives to the standard normal based methods.
In general nonparametric methods are based on the ranks of the observations and focus on the median instead of the mean.
The data were collected by Herman Callaert at Hasselt University in Belgium. The euro coins were borrowed at a local bank. Two assistants, Sofie Bogaerts and Saskia Litiere weighted the coins one by one, in laboratory conditions on a weighing scale of the type Sartorius BP 310.
Say we are told that a one euro coin is supposed to weigh 7.5 grams. Does the data in support that claim?
kable.nice(head(euros))
Weight | Roll |
---|---|
7.512 | 1 |
7.502 | 1 |
7.461 | 1 |
7.562 | 1 |
7.528 | 1 |
7.459 | 1 |
ggplot(euros, aes(y=Weight, x=rep("", 2000))) +
geom_boxplot() +
labs(x="")
The boxplot of Weight shows severe outliers, so the usual 1 sample t test won’t work. Unfortunately the log transformation does not work here either. This is not a surprise, by the way, because the outliers are on both sides of the box.
The name of the test that works here is Wilcoxon Signed Rank Test.
The details are
wilcox.test(euros$Weight, mu=7.5)
##
## Wilcoxon signed rank test with continuity correction
##
## data: euros$Weight
## V = 1595000, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 7.5
Actually, in this data set we could still have used the usual 1-sample t test (also with a p-value of 0.000) because we have a very large sample (n=2000), but in general it is never clear exactly how large a sample needs to be to “overcome” some outliers, so these non-parametric tests are always a safe alternative.
If using the t test sometimes is wrong but the Wilcoxon Rank Sum test always works, why not just always use this test and be safe? The answer is that the t test has a larger power:
mu <- seq(0, 1.5, length=100)
pw <- matrix(0, 100, 2)
colnames(pw) <- c("t test", "Wilcoxon")
B <- 10000
for(i in 1:100) {
for(j in 1:B) {
x <- rnorm(10, mu[i])
pval <- t.test(x, mu=0)$p.value
if(pval<0.05) pw[i, 1] <- pw[i, 1]+1
pval <- wilcox.test(x, mu=0)$p.value
if(pval<0.05) pw[i, 2] <- pw[i, 2]+1
}
}
pw <- 100*pw/B
df <- data.frame(
Mean=c(mu, mu),
Power=c(pw[, 1], pw[, 2]),
Method=rep(c("t test", "Wilcoxon"), each=100))
ggplot(df, aes(Mean, Power, color=Method)) +
geom_line()
In real life the power of the nonparametric tests is often almost as high as the power of the standard tests, so they should always be used if there is a question about the normal assumption.
If we wanted a 90% confidence interval for median we could use
round(as.numeric(
wilcox.test(euros$Weight,
conf.int=TRUE,
conf.level=0.9)$conf.int), 3)
## [1] 7.520 7.522
just as with the t.test command, the wilcox.test command can also be used to compare the means of two populations:
Are the means of the weights of the coins in rolls 7 and 8 different?
x <- euros$Weight[euros$Roll==7]
y <- euros$Weight[euros$Roll==8]
wilcox.test(x, y)
##
## Wilcoxon rank sum test with continuity correction
##
## data: x and y
## W = 35619, p-value = 0.00684
## alternative hypothesis: true location shift is not equal to 0
Say we want to know whether the coins in the 8 different rolls have the same average weight. The non-parametric alternative to the oneway ANOVA is the Kruskal-Wallis test:
kruskal.test(Weight~factor(Roll), data=euros)
##
## Kruskal-Wallis rank sum test
##
## data: Weight by factor(Roll)
## Kruskal-Wallis chi-squared = 97.5, df = 7, p-value < 2.2e-16
A US company manufactures equipment that is used in the production of semiconductors. The firm is considering a costly redesign that will improve the performance of its equipment. The performance is characterized as mean time between failures (MTBF). Most of the companies customers are in the USA, Europe and Japan, and there is anecdotal evidence that the Japanese customers typically get better performance from the users in the USA and Europe.
kable.nice(head(culture))
Country | MTBF |
---|---|
USA | 120.5 |
USA | 127.1 |
USA | 128.1 |
USA | 129.7 |
USA | 130.8 |
USA | 132.4 |
table(culture$Country)
##
## Europe Japan USA
## 15 12 20
ggplot(culture, aes(Country, MTBF)) +
geom_boxplot()
There is a problem with the normal assumption. We can try to fix this with the log transform, but again this does not work.
Because none of the transformations worked we will use the non-parametric Kruskall-Wallis test:
kruskal.test(MTBF~factor(Country), data=culture)
##
## Kruskal-Wallis rank sum test
##
## data: MTBF by factor(Country)
## Kruskal-Wallis chi-squared = 13.806, df = 2, p-value = 0.001005
If we had just done the ANOVA Country would not have been stat. significant (p-value = 0.098) but if you remember to check the normal plot you will see that there is a problem with this analysis.