Transformations and Nonparametric Methods

Many classic methods of analysis (t tests, F tests etc) have an assumption of data that comes from a normal distribution. What can we do if we don’t have that? There are two ways to proceed:

Transformations

A data transformation is any mathematical function applied to the data. Sometimes such a transformation can be used to make distribution more normal:

Example: Body and Brain Weight

Consider the data set brainsize, which has the weights of the body (in kg) and of the brain (in gram) of 62 mammals:

kable.nice(brainsize)

Animal	body.wt.kg	brain.wt.g
African elephant	6654.000	5712.00
African giant pouched rat	1.000	6.60
Arctic Fox	3.385	44.50
Arctic ground squirrel	0.920	5.70
Asian elephant	2547.000	4603.00
Baboon	10.550	179.50
Big brown bat	0.023	0.30
Brazilian tapir	160.000	169.00
Cat	3.300	25.60
Chimpanzee	52.160	440.00
Chinchilla	0.425	6.40
Cow	465.000	423.00
Desert hedgehog	0.550	2.40
Donkey	187.100	419.00
Eastern American mole	0.075	1.20
Echidna	3.000	25.00
European hedgehog	0.785	3.50
Galago	0.200	5.00
Genet	1.410	17.50
Giant armadillo	60.000	81.00
Giraffe	529.000	680.00
Goat	27.660	115.00
Golden hamster	0.120	1.00
Gorilla	207.000	406.00
Gray seal	85.000	325.00
Gray wolf	36.330	119.50
Ground squirrel	0.101	4.00
Guinea pig	1.040	5.50
Horse	521.000	655.00
Jaguar	100.000	157.00
Kangaroo	35.000	56.00
Lesser short-tailed shrew	0.005	0.14
Little brown bat	0.010	0.25
Man	62.000	1320.00
Mole rat	0.122	3.00
Mountain beaver	1.350	8.10
Mouse	0.023	0.40
Musk shrew	0.048	0.33
N. American opossum	1.700	6.30
Nine-banded armadillo	3.500	10.80
Okapi	250.000	490.00
Owl monkey	0.480	15.50
Patas monkey	10.000	115.00
Phanlanger	1.620	11.40
Pig	192.000	180.00
Rabbit	2.500	12.10
Raccoon	4.288	39.20
Rat	0.280	1.90
Red fox	4.235	50.40
Rhesus monkey	6.800	179.00
Rock hyrax (Hetero. b)	0.750	12.30
Rock hyrax (Procavia hab)	3.600	21.00
Roe deer	83.000	98.20
Sheep	55.500	175.00
Slow loris	1.400	12.50
Star nosed mole	0.060	1.00
Tenrec	0.900	2.60
Tree hyrax	2.000	12.30
Tree shrew	0.104	2.50
Vervet	4.190	58.00
Water opossum	3.500	3.90
Yellow-bellied marmot	4.050	17.00

Let’s say we want to find a \(95\%\) confidence interval for the mean body weight. If we want to use the t.test method we need to check normality:

plt1 <- qplot(data=brainsize, sample=body.wt.kg) +
  stat_qq() + stat_qq_line()
plt2 <- ggplot(data=brainsize , aes(1, body.wt.kg)) +
  geom_boxplot()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt2,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt1 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))

and this is clearly non-normal. However

brainsize$logbody <- log(brainsize$body.wt.kg)
plt1 <- qplot(data=brainsize, sample=logbody) +
  stat_qq() + stat_qq_line()
plt2 <- ggplot(data=brainsize , aes(1, logbody)) +
  geom_boxplot()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt2,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt1 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))

shows that \(\log\)(body.wt.kg) is indeed normally distributed. So now

round(as.numeric(t.test(brainsize$logbody)$conf.int), 3)

## [1] 0.567 2.163

but this is confidence interval for \(\log\)(body.wt.kg), we want one for body.wt.kg. Now

\[ 1-\alpha=P \left( L( \log X ) < \theta < U(\log X) \right) =\\ P \left( \exp \{ L( \log X )\} < \exp \{\theta\} < \exp \{U(\log X)\} \right) \\ \] and so the question is whether

\[ E[X] = \mu = \exp \{\theta\} = \exp \{E[\log X]\} \] and in general the answer is no.

Box-Cox Transforms

Above we used a log transform. In principle any function might work, and there is even a way to pick the best from a list: In 1964 Box and Cox suggested a family of transformations of the form

\[ T_{\lambda }(x)=\left\{ \begin{array}{cc} \frac{x^{\lambda }-1}{\lambda } & \lambda \neq 0 \\ \log x & \lambda =0 \end{array} \right. \] Notice that this is continuous in \(\lambda\): \(\lim_{\lambda \rightarrow 0} T_\lambda (x)=T_0(x)\) and it includes \(1/x, \sqrt{x}, x^k\) etc. To pick the best use

library(MASS)
fit <- lm(brainsize$body.wt.kg~rep(1, 62))
boxcox(fit, lambda=seq(-0.25, 0.25, 0.01))

the vertical lines give a confidence interval for \(\lambda\), and any value inside the interval is acceptable. It seems that for our data \(\lambda=0\) or the log transform is indeed appropriate.

Note that the boxcox command requires a fit object generated by lm or aov but can also be used for single vectors as above.

Example

Name	First Nobel Price	Second Nobel Price
Linus Pauling	Chemistry 1954 (Studies Of Molecular Structure And The Chemical Bond)	Peace 1962 (Fight Against Atomic Testing)
Frederick Sanger	Chemistry 1958 (Determining Structure Of Insulin Molecule)	Chemistry 1980 (Biochemical Studies Of Nucleic Acids)
Marie Sklodowska Curie	Physics 1903 (Discovery Of Radioactivity In Uranium)	Chemistry 1911 (Discovery of the elements radium and polonium)
John Bardeen	Physics 1952 (Development of the transistor effect)	Physics 1972 (Theory of Superconductivity

Patients with advanced cancers of the stomach, bronchus, colon, ovary or breast were treated with ascorbate. The purpose of the study was to determine if patient survival differed with respect to the organ affected by the cancer.

Cameron, E. and Pauling, L. (1978) Supplemental ascorbate in the supportive treatment of cancer: re-evaluation of prolongation of survival times in terminal human cancer. Proceedings of the National Academy of Science USA, 75, 4538-4542.

head(cancersurvival)

##   Survival  Cancer
## 1      124 Stomach
## 2       42 Stomach
## 3       25 Stomach
## 4       45 Stomach
## 5      412 Stomach
## 6       51 Stomach

table(cancersurvival$Cancer)

## 
##   Breast Bronchus    Colon    Ovary  Stomach 
##       11       17       17        6       13

Let’s order the cancers by means:

mn <- sort(tapply(cancersurvival$Survival,
                  cancersurvival$Cancer, mean))
cancersurvival$Cancer <- factor(cancersurvival$Cancer,
                      levels = unique(names(mn)),
                      ordered = TRUE)

ggplot(data=cancersurvival , aes(Cancer, Survival)) +
  geom_boxplot()

there are a number of outliers, so we need a transformation. Let’s see:

fit <- aov(Survival~Cancer, data=cancersurvival)
boxcox(fit, lambda=seq(-0.5, 0.5, 0.1))

so again a log transform can work

cancersurvival$logsurvival <- log(cancersurvival$Survival)

ggplot(data=cancersurvival , aes(Cancer, logsurvival)) +
  geom_boxplot()

The table of summary statistics can be done in two ways: based on the original data and using median and IQR or the transformed data using mean and standard deviation.

out <- matrix(0, 5, 3)
colnames(out) <- c("n", "Median", "IQR")
rownames(out) <- as.character(levels(cancersurvival$Cancer))
out[, 1] <- tapply(cancersurvival$Survival, 
                   cancersurvival$Cancer, length)
out[, 2] <- round(tapply(cancersurvival$Survival, 
                   cancersurvival$Cancer, mean), 2)
out[, 3] <- round(tapply(cancersurvival$Survival, 
                   cancersurvival$Cancer, sd), 2)
kable.nice(out)

	n	Median	IQR
Bronchus	17	211.59	209.86
Stomach	13	286.00	346.31
Colon	17	457.41	427.17
Ovary	6	884.33	1098.58
Breast	11	1395.91	1238.97

Now for the test:

fit <- aov(logsurvival~Cancer, data=cancersurvival)
summary(fit)

##             Df Sum Sq Mean Sq F value  Pr(>F)
## Cancer       4  24.49   6.122   4.286 0.00412
## Residuals   59  84.27   1.428

p-value=0.0041, so we reject the null hypothesis, there are differences in the mean survival times.

check assumptions:

df <- data.frame(Residuals=resid(fit))
ggplot(data=df, aes(sample=Residuals)) +
           geom_qq() + geom_qq_line()

round(sort(tapply(cancersurvival$logsurvival,
                  cancersurvival$Cancer, mean))[c(1, 5)], 1)

## Bronchus   Breast 
##      5.0      6.6

and pairwise comparisons:

tuk <- TukeyHSD(fit)
names(tuk[[1]][tuk[[1]][,4]<0.05, 4])

## [1] "Breast-Bronchus" "Breast-Stomach"

A different way to proceed is to use a method that does not require a normal distribution. These are so called nonparametric methods. A number of them have been developed over the years as alternatives to the standard normal based methods.

In general nonparametric methods are based on the ranks of the observations and focus on the median instead of the mean.

Alternative to 1 Sample t

Example: Euro Coins

The data were collected by Herman Callaert at Hasselt University in Belgium. The euro coins were borrowed at a local bank. Two assistants, Sofie Bogaerts and Saskia Litiere weighted the coins one by one, in laboratory conditions on a weighing scale of the type Sartorius BP 310.

Say we are told that a one euro coin is supposed to weigh 7.5 grams. Does the data in support that claim?

kable.nice(head(euros))

Weight	Roll
7.512	1
7.502	1
7.461	1
7.562	1
7.528	1
7.459	1

ggplot(euros, aes(y=Weight, x=rep("", 2000))) +
  geom_boxplot() +
  labs(x="")

The boxplot of Weight shows severe outliers, so the usual 1 sample t test won’t work. Unfortunately the log transformation does not work here either. This is not a surprise, by the way, because the outliers are on both sides of the box.

The name of the test that works here is Wilcoxon Signed Rank Test.

The details are

wilcox.test(euros$Weight, mu=7.5)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  euros$Weight
## V = 1595000, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 7.5

Parameter of interest: 1 median
Method of analysis: Wilcoxon Signed Rank test
Assumptions of Method: none
\(\alpha\) = 0.05
Null hypothesis H₀: M=7.5 (median weight is 7.5 grams)
Alternative hypothesis H_a: M \(\ne\) 7.5 (median weight is not 7.5 grams)
p value = 0.000
0.000<0.05, so we reject H₀, it seems the median weight is not 7.5 grams.

Actually, in this data set we could still have used the usual 1-sample t test (also with a p-value of 0.000) because we have a very large sample (n=2000), but in general it is never clear exactly how large a sample needs to be to “overcome” some outliers, so these non-parametric tests are always a safe alternative.

Why not always use the non-parametric test?

If using the t test sometimes is wrong but the Wilcoxon Rank Sum test always works, why not just always use this test and be safe? The answer is that the t test has a larger power:

mu <- seq(0, 1.5, length=100)
pw <- matrix(0, 100, 2)
colnames(pw) <- c("t test", "Wilcoxon")
B <- 10000
for(i in 1:100) {
  for(j in 1:B) {
    x <- rnorm(10, mu[i])
    pval <- t.test(x, mu=0)$p.value
    if(pval<0.05) pw[i, 1] <- pw[i, 1]+1
    pval <- wilcox.test(x, mu=0)$p.value
    if(pval<0.05) pw[i, 2] <- pw[i, 2]+1
  }
}
pw <- 100*pw/B
df <- data.frame(
        Mean=c(mu, mu),
        Power=c(pw[, 1], pw[, 2]),
        Method=rep(c("t test", "Wilcoxon"), each=100))
ggplot(df, aes(Mean, Power, color=Method)) +
  geom_line()

In real life the power of the nonparametric tests is often almost as high as the power of the standard tests, so they should always be used if there is a question about the normal assumption.

If we wanted a 90% confidence interval for median we could use

round(as.numeric(
  wilcox.test(euros$Weight, 
            conf.int=TRUE, 
            conf.level=0.9)$conf.int), 3)

## [1] 7.520 7.522

Alternative to two sample t

just as with the t.test command, the wilcox.test command can also be used to compare the means of two populations:

Example: Euro Coins

Are the means of the weights of the coins in rolls 7 and 8 different?

x <- euros$Weight[euros$Roll==7]
y <- euros$Weight[euros$Roll==8]
wilcox.test(x, y)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  x and y
## W = 35619, p-value = 0.00684
## alternative hypothesis: true location shift is not equal to 0

Alternative to ANOVA

Example: Euro Coins

Say we want to know whether the coins in the 8 different rolls have the same average weight. The non-parametric alternative to the oneway ANOVA is the Kruskal-Wallis test:

kruskal.test(Weight~factor(Roll), data=euros)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Weight by factor(Roll)
## Kruskal-Wallis chi-squared = 97.5, df = 7, p-value < 2.2e-16

Parameters of interest: medians
Method of analysis: Kruskal-Wallis
Assumptions of Method: none
\(\alpha\) = 0.05
Null hypothesis H₀: M₁=..=M₈ (group medians are the same)
Alternative hypothesis H_a: M_i \(\ne\) M_j for some i, j(group medians are not the same)
p value = 0.00
0.00 < 0.05, so we reject H₀, it seems the group medians are not the same

Example: Cultural Differences in Equipment Use

A US company manufactures equipment that is used in the production of semiconductors. The firm is considering a costly redesign that will improve the performance of its equipment. The performance is characterized as mean time between failures (MTBF). Most of the companies customers are in the USA, Europe and Japan, and there is anecdotal evidence that the Japanese customers typically get better performance from the users in the USA and Europe.

kable.nice(head(culture))

Country	MTBF
USA	120.5
USA	127.1
USA	128.1
USA	129.7
USA	130.8
USA	132.4

table(culture$Country)

## 
## Europe  Japan    USA 
##     15     12     20

ggplot(culture, aes(Country, MTBF)) +
  geom_boxplot()

There is a problem with the normal assumption. We can try to fix this with the log transform, but again this does not work.

Because none of the transformations worked we will use the non-parametric Kruskall-Wallis test:

kruskal.test(MTBF~factor(Country), data=culture)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  MTBF by factor(Country)
## Kruskal-Wallis chi-squared = 13.806, df = 2, p-value = 0.001005

Parameters of interest: medians
Method of analysis: Kruskal-Wallis
Assumptions of Method: none
\(\alpha = 0.05\)
Null hypothesis H₀: M₁ = M₂ = M₃ (group medians are the same)
Alternative hypothesis H_a: M_i \(\ne\) M_j for some i, j (group medians are not the same)
p value = 0.001
0.001 < 0.05, so we reject H₀, it seems the group medians are not the same, the MTBF is different in different countries

If we had just done the ANOVA Country would not have been stat. significant (p-value = 0.098) but if you remember to check the normal plot you will see that there is a problem with this analysis.