Transformations and Nonparametric Methods

Many classic methods of analysis (t tests, F tests etc) have an assumption of data that comes from a normal distribution. What can we do if we don’t have that? There are two ways to proceed:

Transformations

A data transformation is any mathematical function applied to the data. Sometimes such a transformation can be used to make distribution more normal:

Example: Body and Brain Weight

Consider the data set brainsize, which has the weights of the body (in kg) and of the brain (in gram) of 62 mammals:

kable.nice(brainsize)
Animal body.wt.kg brain.wt.g
African elephant 6654.000 5712.00
African giant pouched rat 1.000 6.60
Arctic Fox 3.385 44.50
Arctic ground squirrel 0.920 5.70
Asian elephant 2547.000 4603.00
Baboon 10.550 179.50
Big brown bat 0.023 0.30
Brazilian tapir 160.000 169.00
Cat 3.300 25.60
Chimpanzee 52.160 440.00
Chinchilla 0.425 6.40
Cow 465.000 423.00
Desert hedgehog 0.550 2.40
Donkey 187.100 419.00
Eastern American mole 0.075 1.20
Echidna 3.000 25.00
European hedgehog 0.785 3.50
Galago 0.200 5.00
Genet 1.410 17.50
Giant armadillo 60.000 81.00
Giraffe 529.000 680.00
Goat 27.660 115.00
Golden hamster 0.120 1.00
Gorilla 207.000 406.00
Gray seal 85.000 325.00
Gray wolf 36.330 119.50
Ground squirrel 0.101 4.00
Guinea pig 1.040 5.50
Horse 521.000 655.00
Jaguar 100.000 157.00
Kangaroo 35.000 56.00
Lesser short-tailed shrew 0.005 0.14
Little brown bat 0.010 0.25
Man 62.000 1320.00
Mole rat 0.122 3.00
Mountain beaver 1.350 8.10
Mouse 0.023 0.40
Musk shrew 0.048 0.33
N. American opossum 1.700 6.30
Nine-banded armadillo 3.500 10.80
Okapi 250.000 490.00
Owl monkey 0.480 15.50
Patas monkey 10.000 115.00
Phanlanger 1.620 11.40
Pig 192.000 180.00
Rabbit 2.500 12.10
Raccoon 4.288 39.20
Rat 0.280 1.90
Red fox 4.235 50.40
Rhesus monkey 6.800 179.00
Rock hyrax (Hetero. b) 0.750 12.30
Rock hyrax (Procavia hab) 3.600 21.00
Roe deer 83.000 98.20
Sheep 55.500 175.00
Slow loris 1.400 12.50
Star nosed mole 0.060 1.00
Tenrec 0.900 2.60
Tree hyrax 2.000 12.30
Tree shrew 0.104 2.50
Vervet 4.190 58.00
Water opossum 3.500 3.90
Yellow-bellied marmot 4.050 17.00

Let’s say we want to find a \(95\%\) confidence interval for the mean body weight. If we want to use the t.test method we need to check normality:

plt1 <- qplot(data=brainsize, sample=body.wt.kg) +
  stat_qq() + stat_qq_line()
plt2 <- ggplot(data=brainsize , aes(1, body.wt.kg)) +
  geom_boxplot()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt2,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt1 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))        

and this is clearly non-normal. However

brainsize$logbody <- log(brainsize$body.wt.kg)
plt1 <- qplot(data=brainsize, sample=logbody) +
  stat_qq() + stat_qq_line()
plt2 <- ggplot(data=brainsize , aes(1, logbody)) +
  geom_boxplot()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt2,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt1 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))        

shows that \(\log\)(body.wt.kg) is indeed normally distributed. So now

round(as.numeric(t.test(brainsize$logbody)$conf.int), 3)
## [1] 0.567 2.163

but this is confidence interval for \(\log\)(body.wt.kg), we want one for body.wt.kg. Now

\[ 1-\alpha=P \left( L( \log X ) < \theta < U(\log X) \right) =\\ P \left( \exp \{ L( \log X )\} < \exp \{\theta\} < \exp \{U(\log X)\} \right) \\ \] and so the question is whether

\[ E[X] = \mu = \exp \{\theta\} = \exp \{E[\log X]\} \] and in general the answer is no.

Box-Cox Transforms

Above we used a log transform. In principle any function might work, and there is even a way to pick the best from a list: In 1964 Box and Cox suggested a family of transformations of the form

\[ T_{\lambda }(x)=\left\{ \begin{array}{cc} \frac{x^{\lambda }-1}{\lambda } & \lambda \neq 0 \\ \log x & \lambda =0 \end{array} \right. \] Notice that this is continuous in \(\lambda\): \(\lim_{\lambda \rightarrow 0} T_\lambda (x)=T_0(x)\) and it includes \(1/x, \sqrt{x}, x^k\) etc. To pick the best use

library(MASS)
fit <- lm(brainsize$body.wt.kg~rep(1, 62))
boxcox(fit, lambda=seq(-0.25, 0.25, 0.01))

the vertical lines give a confidence interval for \(\lambda\), and any value inside the interval is acceptable. It seems that for our data \(\lambda=0\) or the log transform is indeed appropriate.

Note that the boxcox command requires a fit object generated by lm or aov but can also be used for single vectors as above.

Example

Name First Nobel Price Second Nobel Price
Linus Pauling Chemistry 1954 (Studies Of Molecular Structure And The Chemical Bond) Peace 1962 (Fight Against Atomic Testing)
Frederick Sanger Chemistry 1958 (Determining Structure Of Insulin Molecule) Chemistry 1980 (Biochemical Studies Of Nucleic Acids)
Marie Sklodowska Curie Physics 1903 (Discovery Of Radioactivity In Uranium) Chemistry 1911 (Discovery of the elements radium and polonium)
John Bardeen Physics 1952 (Development of the transistor effect) Physics 1972 (Theory of Superconductivity

Patients with advanced cancers of the stomach, bronchus, colon, ovary or breast were treated with ascorbate. The purpose of the study was to determine if patient survival differed with respect to the organ affected by the cancer.

Cameron, E. and Pauling, L. (1978) Supplemental ascorbate in the supportive treatment of cancer: re-evaluation of prolongation of survival times in terminal human cancer. Proceedings of the National Academy of Science USA, 75, 4538-4542.

head(cancersurvival)
##   Survival  Cancer
## 1      124 Stomach
## 2       42 Stomach
## 3       25 Stomach
## 4       45 Stomach
## 5      412 Stomach
## 6       51 Stomach
table(cancersurvival$Cancer)
## 
##   Breast Bronchus    Colon    Ovary  Stomach 
##       11       17       17        6       13

Let’s order the cancers by means:

mn <- sort(tapply(cancersurvival$Survival,
                  cancersurvival$Cancer, mean))
cancersurvival$Cancer <- factor(cancersurvival$Cancer,
                      levels = unique(names(mn)),
                      ordered = TRUE)
ggplot(data=cancersurvival , aes(Cancer, Survival)) +
  geom_boxplot()

there are a number of outliers, so we need a transformation. Let’s see:

fit <- aov(Survival~Cancer, data=cancersurvival)
boxcox(fit, lambda=seq(-0.5, 0.5, 0.1))

so again a log transform can work

cancersurvival$logsurvival <- log(cancersurvival$Survival)
ggplot(data=cancersurvival , aes(Cancer, logsurvival)) +
  geom_boxplot()

The table of summary statistics can be done in two ways: based on the original data and using median and IQR or the transformed data using mean and standard deviation.

out <- matrix(0, 5, 3)
colnames(out) <- c("n", "Median", "IQR")
rownames(out) <- as.character(levels(cancersurvival$Cancer))
out[, 1] <- tapply(cancersurvival$Survival, 
                   cancersurvival$Cancer, length)
out[, 2] <- round(tapply(cancersurvival$Survival, 
                   cancersurvival$Cancer, mean), 2)
out[, 3] <- round(tapply(cancersurvival$Survival, 
                   cancersurvival$Cancer, sd), 2)
kable.nice(out)
n Median IQR
Bronchus 17 211.59 209.86
Stomach 13 286.00 346.31
Colon 17 457.41 427.17
Ovary 6 884.33 1098.58
Breast 11 1395.91 1238.97

Now for the test:

fit <- aov(logsurvival~Cancer, data=cancersurvival)
summary(fit)
##             Df Sum Sq Mean Sq F value  Pr(>F)
## Cancer       4  24.49   6.122   4.286 0.00412
## Residuals   59  84.27   1.428

p-value=0.0041, so we reject the null hypothesis, there are differences in the mean survival times.

check assumptions:

df <- data.frame(Residuals=resid(fit))
ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line()       

ok

round(sort(tapply(cancersurvival$logsurvival,
                  cancersurvival$Cancer, mean))[c(1, 5)], 1)
## Bronchus   Breast 
##      5.0      6.6

ok

and pairwise comparisons:

tuk <- TukeyHSD(fit)
names(tuk[[1]][tuk[[1]][,4]<0.05, 4])
## [1] "Breast-Bronchus" "Breast-Stomach"

A different way to proceed is to use a method that does not require a normal distribution. These are so called nonparametric methods. A number of them have been developed over the years as alternatives to the standard normal based methods.

In general nonparametric methods are based on the ranks of the observations and focus on the median instead of the mean.

Alternative to 1 Sample t

Example: Euro Coins

The data were collected by Herman Callaert at Hasselt University in Belgium. The euro coins were borrowed at a local bank. Two assistants, Sofie Bogaerts and Saskia Litiere weighted the coins one by one, in laboratory conditions on a weighing scale of the type Sartorius BP 310.

Say we are told that a one euro coin is supposed to weigh 7.5 grams. Does the data in support that claim?

kable.nice(head(euros))
Weight Roll
7.512 1
7.502 1
7.461 1
7.562 1
7.528 1
7.459 1
ggplot(euros, aes(y=Weight, x=rep("", 2000))) +
  geom_boxplot() +
  labs(x="")

The boxplot of Weight shows severe outliers, so the usual 1 sample t test won’t work. Unfortunately the log transformation does not work here either. This is not a surprise, by the way, because the outliers are on both sides of the box.

The name of the test that works here is Wilcoxon Signed Rank Test.

The details are

wilcox.test(euros$Weight, mu=7.5)  
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  euros$Weight
## V = 1595000, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 7.5
  1. Parameter of interest: 1 median
  2. Method of analysis: Wilcoxon Signed Rank test
  3. Assumptions of Method: none
  4. \(\alpha\) = 0.05
  5. Null hypothesis H0: M=7.5 (median weight is 7.5 grams)
  6. Alternative hypothesis Ha: M \(\ne\) 7.5 (median weight is not 7.5 grams)
  7. p value = 0.000
  8. 0.000<0.05, so we reject H0, it seems the median weight is not 7.5 grams.

Actually, in this data set we could still have used the usual 1-sample t test (also with a p-value of 0.000) because we have a very large sample (n=2000), but in general it is never clear exactly how large a sample needs to be to “overcome” some outliers, so these non-parametric tests are always a safe alternative.

Why not always use the non-parametric test?

If using the t test sometimes is wrong but the Wilcoxon Rank Sum test always works, why not just always use this test and be safe? The answer is that the t test has a larger power:

mu <- seq(0, 1.5, length=100)
pw <- matrix(0, 100, 2)
colnames(pw) <- c("t test", "Wilcoxon")
B <- 10000
for(i in 1:100) {
  for(j in 1:B) {
    x <- rnorm(10, mu[i])
    pval <- t.test(x, mu=0)$p.value
    if(pval<0.05) pw[i, 1] <- pw[i, 1]+1
    pval <- wilcox.test(x, mu=0)$p.value
    if(pval<0.05) pw[i, 2] <- pw[i, 2]+1
  }
}
pw <- 100*pw/B
df <- data.frame(
        Mean=c(mu, mu),
        Power=c(pw[, 1], pw[, 2]),
        Method=rep(c("t test", "Wilcoxon"), each=100))
ggplot(df, aes(Mean, Power, color=Method)) +
  geom_line()

In real life the power of the nonparametric tests is often almost as high as the power of the standard tests, so they should always be used if there is a question about the normal assumption.

If we wanted a 90% confidence interval for median we could use

round(as.numeric(
  wilcox.test(euros$Weight, 
            conf.int=TRUE, 
            conf.level=0.9)$conf.int), 3)  
## [1] 7.520 7.522

Alternative to two sample t

just as with the t.test command, the wilcox.test command can also be used to compare the means of two populations:

Example: Euro Coins

Are the means of the weights of the coins in rolls 7 and 8 different?

x <- euros$Weight[euros$Roll==7]
y <- euros$Weight[euros$Roll==8]
wilcox.test(x, y)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  x and y
## W = 35619, p-value = 0.00684
## alternative hypothesis: true location shift is not equal to 0

Alternative to ANOVA

Example: Euro Coins

Say we want to know whether the coins in the 8 different rolls have the same average weight. The non-parametric alternative to the oneway ANOVA is the Kruskal-Wallis test:

kruskal.test(Weight~factor(Roll), data=euros) 
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Weight by factor(Roll)
## Kruskal-Wallis chi-squared = 97.5, df = 7, p-value < 2.2e-16
  1. Parameters of interest: medians
  2. Method of analysis: Kruskal-Wallis
  3. Assumptions of Method: none
  4. \(\alpha\) = 0.05
  5. Null hypothesis H0: M1=..=M8 (group medians are the same)
  6. Alternative hypothesis Ha: Mi \(\ne\) Mj for some i, j(group medians are not the same)
  7. p value = 0.00
  8. 0.00 < 0.05, so we reject H0, it seems the group medians are not the same

Example: Cultural Differences in Equipment Use

A US company manufactures equipment that is used in the production of semiconductors. The firm is considering a costly redesign that will improve the performance of its equipment. The performance is characterized as mean time between failures (MTBF). Most of the companies customers are in the USA, Europe and Japan, and there is anecdotal evidence that the Japanese customers typically get better performance from the users in the USA and Europe.

kable.nice(head(culture))
Country MTBF
USA 120.5
USA 127.1
USA 128.1
USA 129.7
USA 130.8
USA 132.4
table(culture$Country)
## 
## Europe  Japan    USA 
##     15     12     20
ggplot(culture, aes(Country, MTBF)) +
  geom_boxplot()

There is a problem with the normal assumption. We can try to fix this with the log transform, but again this does not work.

Because none of the transformations worked we will use the non-parametric Kruskall-Wallis test:

kruskal.test(MTBF~factor(Country), data=culture)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  MTBF by factor(Country)
## Kruskal-Wallis chi-squared = 13.806, df = 2, p-value = 0.001005
  1. Parameters of interest: medians
  2. Method of analysis: Kruskal-Wallis
  3. Assumptions of Method: none
  4. \(\alpha = 0.05\)
  5. Null hypothesis H0: M1 = M2 = M3 (group medians are the same)
  6. Alternative hypothesis Ha: Mi \(\ne\) Mj for some i, j (group medians are not the same)
  7. p value = 0.001
  8. 0.001 < 0.05, so we reject H0, it seems the group medians are not the same, the MTBF is different in different countries

If we had just done the ANOVA Country would not have been stat. significant (p-value = 0.098) but if you remember to check the normal plot you will see that there is a problem with this analysis.