Data set: euros
say we are told that a one euro coin is supposed to weigh 7.5 grams. Does the data in support that claim?
The boxplot of Weight shows severe outliers, so the usual 1 sample t test won’t work. Unfortunately the log transfromations does not work here either:
attach(euros)
head(euros)
## Weight Roll
## 1 7.512 1
## 2 7.502 1
## 3 7.461 1
## 4 7.562 1
## 5 7.528 1
## 6 7.459 1
bplot(Weight)
bplot(log(Weight))
This is not a surprise, by the way, because the outliers are on both sides of the box.
So, what now? For this situation we have a set of methods called non-parametric, which make no assumptions, especially not the one of the 1 sample t test, namely normal distribution. The name of the test that works here is Wilcoxon Signed Rank Test.
The details are
one.sample.wilcoxon(Weight, med.null=7.5)
## p value of test H0: median=7.5 vs. Ha: median <> 7.5: 0.000
Actually, in this data set we could still have used the usual 1-sample t test (also with a p-value of 0.000) because we have a very large sample (n=2000), but in general it is never clear exactly how large a sample needs to be to “overcome” some outliers, so these non-parametric tests are always a safe alternative.
If using the t test sometimes is wrong but the Wilcoxon Rank Sum test always works, why not just always use this test and be safe? The answer is that the t test has a larger power:
In real life the power of the nonparametric tests is often almost as high as the power of the standard tests, so they should always be used if there is a question about the normal assumption.
The arguments of the one.sample.wilcoxon routine are the same as those of the one.sample.t command. For example if we wanted to test
we could run
one.sample.wilcoxon(Weight, med.null=7.5, alternative = "greater")
## p value of test H0: median=7.5 vs. Ha: median > 7.5: 0.000
If we wanted a 90% confidence interval for median we could use
one.sample.wilcoxon(Weight, conf.level=90, ndigit=4)
## A 90% confidence interval for the population
## median is (7.5195, 7.5224)
Say we want to know whether the coin in the 8 different rolls have the same average weight. The non-parametric alternative to the oneway ANOVA is the Kruskal-Wallis test:
kruskalwallis(Weight, Roll)
## p value of test of equal means: p = 0.000
A US company manufactures equipment that is used in the production of semiconductors. The firm is considering a costly redesign that will improve the performance of its equipment. The performance is characterized as mean time between failures (MTBF). Most of the companies customers are in the USA, Europe and Japan, and there is anectotal evidence that the Japanese customers typically get better performance from the users in the USA and Europe.
Data: MTBF for randomly selected users in the USA, Europe and Japan.
Data set: culture
attach(culture)
head(culture)
## Country MTBF
## 1 USA 120.5
## 2 USA 127.1
## 3 USA 128.1
## 4 USA 129.7
## 5 USA 130.8
## 6 USA 132.4
table(Country)
## Country
## Europe Japan USA
## 15 12 20
bplot(MTBF, Country)
There is a problem with the normal assumption. We can try to fix this with the log transform, but as we can see again this does not work:
bplot(log(MTBF), Country)
Summary Statistics
We could not even find a transformation, so we will use the median and IQR:
stat.table(MTBF, Country, Mean=FALSE)
## Sample Size Median IQR
## USA 20 138.0 16.7
## Europe 15 140.0 24.4
## Japan 12 165.7 27.8
Because none of the transformations worked we will use the non-parametric Kruskall-Wallis test:
kruskalwallis(MTBF, Country)
## p value of test of equal means: p = 0.00100488597092579
If we had just done the ANOVA Country would not have been stat. significant (p-value = 0.098) but if you remember to check the normal plot you will see that there is a problem with this analysis.
If the transformations fail in a regression problem things become very tricky, and far beyond the scope of this class. Talk to a professional!