In this section we will discuss some of the standard (frequentist) methods in Statistics.
The basic R command for inference for a population mean is t.test.
Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use.
The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn.
Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome
Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff
Obstetrics and Gynecology 1989;74:882-885
Let’s ignore the drug status for the moment and find a \(90\%\) confidence interval for the length of a newborn baby:
t.test(mothers$Length)
##
## One Sample t-test
##
## data: mothers$Length
## t = 141.83, df = 93, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 48.85519 50.24269
## sample estimates:
## mean of x
## 49.54894
the t.test command does a number of different things, if we want the interval alone we can use
round(as.numeric(t.test(mothers$Length)$conf.int), 2)
## [1] 48.86 50.24
The assumptions for this method are:
Let’s check with the normal probability plot:
df <- data.frame(x=mothers$Length)
ggplot(df, aes(sample=x)) +
stat_qq() + stat_qq_line()
This is fine.
Some Monarch butterflies fly early in the day, others somewhat later. After the flight they have to rest for a short period. It has been theorized that the resting period (RIP) of butterflies flying early in the morning is shorter because this is a thermoregulatory mechanism, and it is cooler in the mornings. The mean RIP of all Monarch butterflies is 133 sec. Test the theory at the 10% level.
Research by Anson Lui, Resting period of early and late flying Monarch butterflies Danaeus plexippus, 1997
t.test(butterflies$RIP.sec.,
mu=133,
alternative = "less")$p.value
## [1] 0.05583963
Checking the assumption:
df <- data.frame(x=butterflies$RIP.sec.)
ggplot(df, aes(sample=x)) +
stat_qq() + stat_qq_line()
looks good.
The power of a test is the probability to correctly reject the null if the null is false. The power will always depend on an assumed value for the mean.
Let’s say the true mean resting period is 120.7 seconds. What was the probability that Anson’s experiment would have detected this? That is, what is
\[ P_{\mu=120.7}(\text{reject null}) \]
First we need to find the critical region of this test, so we know what reject null actually means. Now if the test is done at the 5% level we reject the null if the p value is less than 0.05. How can we find out for what value of the sample mean this will happen? Let’s do the following:
generate data from a normal distribution with mean \(\mu\), standard deviation as in our data and the same number of observations
find the p value and check whether it is < 0.05
repeat many times and find average.
use trial and error on \(\mu\) until the probability is (about) 0.05:
mean.M <- function(M, true.mu=133) {
B <- 10000
pvals <- rep(0, B)
for(i in 1:B) {
x <- rnorm(length(butterflies$RIP.sec.), M,
sd(butterflies$RIP.sec.))
pvals[i] <- t.test(x, mu=true.mu,
alternative = "less")$p.value
}
1-sum(pvals<0.05)/B
}
mean.M(120)
## [1] 0.2399
mean.M(110)
## [1] 0.0066
mean.M(115)
## [1] 0.0532
just about right! We now know that “reject null” means \(\bar{X}<115\).
Now we turn this around: if the true mean where 120.5, what is the probability that we would reject the null, that is get a sample mean of 115 or less? Actually, we can again use the same routine:
mean.M(115, true.mu = 120.7)
## [1] 0.724
Of course here this can also be done analytically: \[ \begin{aligned} &P_{\mu=133}(\text{ reject null }) = \\ &P_{\mu=133}(\bar{X}< \text{crit}) = 0.05 \\ \end{aligned} \] Now \(\bar{X} \sim N(133, s/\sqrt{40})\), so
crit <- qnorm(0.05, 133, sd(butterflies$RIP.sec.)/sqrt(40))
crit
## [1] 124.0558
(Actually the distribution is t, not a normal, but we will ignore this here)
and now
\[ \begin{aligned} &P_{\mu=120.7}(\text{ reject null }) = \\ &P_{\mu=120.7}P(\bar{X}<124.06) =\\ \end{aligned} \]
pnorm(124.06, 120.7, sd(butterflies$RIP.sec.)/sqrt(40))
## [1] 0.7316819
and of course we get the same answer.
There is also an R command to do this:
power.t.test(n=40,
delta=133-120.7,
sd=sd(butterflies$RIP.sec.),
alternative = "one.sided",
type = "one.sample")
##
## One-sample t test power calculation
##
## n = 40
## delta = 12.3
## sd = 34.39108
## sig.level = 0.05
## power = 0.7182354
## alternative = one.sided
Usually one wants to study the power for a whole range of values. This is done by drawing a power curve:
x <- seq(110, 133, 0.1)
y <- x
for(i in seq_along(x))
y[i] <- power.t.test(n=40,
delta=133-x[i],
sd=sd(butterflies$RIP.sec.),
alternative = "one.sided",
type = "one.sample")$power
df <- data.frame(Mean=x, Power=y)
ggplot(df, aes(Mean, Power)) +
geom_line(col="blue", size=1.2)
so, if the true mean resting period is 120.7 seconds the power is 72%. What sample size would we need to have a power of 95%
round(power.t.test(power=0.95,
delta=133-120.7,
sd=sd(butterflies$RIP.sec.),
alternative = "one.sided",
type = "one.sample")$n)
## [1] 86
The sample size issue also arises when we want to find a confidence interval. Here the number that corresponds to the power is the error E, that is half the length of the interval.
Analytically we find
\[ \begin{aligned} &\bar{X} \pm z_{\alpha/2}s/\sqrt{n} \\ &E= z_{\alpha/2}s/\sqrt{n}\\ &n = (\frac{z_{\alpha/2}s}{E})^2 \\ \end{aligned} \] let’s see
I <- t.test(butterflies$RIP.sec.)$conf.int
diff(I)/2
## [1] 10.9988
so a 95% confidence interval has an error of 11. If we wanted an error of 5 we would need a sample size of
round((qnorm(0.975)*sd(butterflies$RIP.sec.)/5)^2)
## [1] 182
The R routine for inference for a proportion (or a probability or a percentage) is binom.test. This implements a method by Clopper and Pearson (1934). This method is exact and has no assumptions.
Note The formula discussed in many introductory statistic courses for the confidence interval is
\[ \hat p \pm z_{\alpha/2}\sqrt{\frac{\hat p (1-\hat p)}{n} } \]
where \(\hat p\) is the proportion of success. This leads to confidence intervals that are now known to be quite wrong, and so this method should not be used anymore. The same is true for the corresponding hypothesis test. This method (actually a slight improvement due to Wilson (1927)) is implemented in R by prop.test.
The South African Jon Kerrich spent some time in a German prisoner of war camp during world war I. He used his time to flip a coin 10000 times, resulting in 5067 heads.
Test at the 5% level of significance whether 5067 heads in 10000 flips are compatible with a fair coin.
binom.test(x = 5067, n = 10000)$p.value
## [1] 0.1835155
Say some polling institute wants to conduct a poll for the next election for president. They will then find a 95% confidence interval and they want this interval to have an error of 3 percentage points (aka \(\pm 0.03\)). What sample size do they need?
In American politics the two parties are always very close, so in a poll with n people about n/2 will vote for one or the other party. Let’s do a little trial and error:
n <- 100
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)
## [1] 0.1016789
Now that is to large, so
n <- 200
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)
## [1] 0.07134157
n <- 400
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)
## [1] 0.05009211
n <- 800
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)
## [1] 0.03521797
n <- 1200
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)
## [1] 0.02867679
n <- 1100
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)
## [1] 0.02996843
n <- 1050
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)
## [1] 0.03068294
Although the confidence interval using the formula
\[ \hat p \pm z_{\alpha/2}\sqrt{\frac{\hat p (1-\hat p)}{n} } \]
should not be used for calculating an interval, it it useful when considering the sample size question: above we assume \(\hat{p} \approx 1/2\), so
\[ \begin{aligned} &E = z_{\alpha/2}\sqrt{\frac{\hat p (1-\hat p)}{n}} \\ &n = \frac{z_{\alpha/2}^2\hat p (1-\hat p)}{E^2} = \\ &\frac{1.96^2 \frac12 (1-\frac12)}{0.03^2}= 1067\\ \end{aligned} \] There is a nice feature of this formula: note that \(\hat p (1-\hat p)<\frac14\) for all \(\hat{p}\), so no matter what the actual value of \(\pi\) is, a sample of size \(n=(\frac{z_{\alpha/2}}{2E})^2\) will always be enough. If we need a \(95\%\) interval, then \(z_{\alpha/2}=1.96 \approx 2\), and we have \(n=1/E^2\).
There is something quite remarkable about these formulas!
If the goal is to do a hypothesis test there is a library we can use called pwr:
library(pwr)
pwr.p.test(ES.h(0.505, 0.5), n=10000)
##
## proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.01000017
## n = 10000
## sig.level = 0.05
## power = 0.1700792
## alternative = two.sided
pwr.p.test(ES.h(0.505, 0.5), power=0.9)$n
## [1] 105070.7
Note R has the routine power.prop.test built in, but it is for the two-sample proportion problem.
ggplot(data=mothers, aes(Status, Length)) +
geom_boxplot()
Let’s say we want to find a \(90\%\) confidence interval for the difference in lengths of babies in Drug Free and Throughout groups:
round(as.numeric(
t.test(mothers$Length[mothers$Status=="Drug Free"],
mothers$Length[mothers$Status=="Throughout"],
conf.level=0.9)$conf.int
), 2)
## [1] 1.83 4.37
There is another version of this problem, called the paired data problem. Say we have data on particiants in a weight loss program, and for each we know the weight before and after the program. In this case one calculates the difference and uses the one sample methods.
In a study 300 men and 200 women were asked whether they had been dining out during the last week. 187 of the men and, 112 of the women said yes.
Test at the \(5\%\) level whether there is a stat. significant difference between the genders.
round(prop.test(c(187, 112), c(300, 200))$p.value, 3)
## [1] 0.186
this test is based on the normal approximation, so it has the same issues as the one-sample prop test. As an alternative we can use Fisher’s exact test:
fisher.test(cbind(c(187, 300-187), c(112, 200-112)))
##
## Fisher's Exact Test for Count Data
##
## data: cbind(c(187, 300 - 187), c(112, 200 - 112))
## p-value = 0.1636
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.888221 1.901356
## sample estimates:
## odds ratio
## 1.299544
There is an important difference between these tests: the first one assumes that we randomly selected 500 people. 300 turned out to be male, 187 of whom said they had been out dining during the last week. In Fisher’s exact test the 300 is assumed to have been fixed before the experiment (aka it is not random).
What are the correlations between the various variables?
head(upr, 2)
## ID.Code Year Gender Program.Code Highschool.GPA Aptitud.Verbal
## 1 00C2B4EF77 2005 M 502 3.97 647
## 2 00D66CF1BF 2003 M 502 3.80 597
## Aptitud.Matem Aprov.Ingles Aprov.Matem Aprov.Espanol IGS Freshmen.GPA
## 1 621 626 672 551 342 3.67
## 2 726 618 718 575 343 2.75
## Graduated Year.Grad. Grad..GPA Class.Facultad
## 1 Si 2012 3.33 INGE
## 2 No NA NA INGE
Let’s take out the those variables that are either not numerical or not useful for prediction:
x <- upr[, -c(1, 2, 3, 4, 13, 14, 16)]
head(x, 2)
## Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles Aprov.Matem
## 1 3.97 647 621 626 672
## 2 3.80 597 726 618 718
## Aprov.Espanol IGS Freshmen.GPA Grad..GPA
## 1 551 342 3.67 3.33
## 2 575 343 2.75 NA
round(cor(x, use = "complete.obs") ,3)
## Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles
## Highschool.GPA 1.000 0.205 0.163 0.056
## Aptitud.Verbal 0.205 1.000 0.497 0.519
## Aptitud.Matem 0.163 0.497 1.000 0.474
## Aprov.Ingles 0.056 0.519 0.474 1.000
## Aprov.Matem 0.213 0.515 0.821 0.510
## Aprov.Espanol 0.268 0.604 0.408 0.439
## IGS 0.666 0.734 0.769 0.462
## Freshmen.GPA 0.395 0.339 0.290 0.270
## Grad..GPA 0.385 0.349 0.316 0.280
## Aprov.Matem Aprov.Espanol IGS Freshmen.GPA Grad..GPA
## Highschool.GPA 0.213 0.268 0.666 0.395 0.385
## Aptitud.Verbal 0.515 0.604 0.734 0.339 0.349
## Aptitud.Matem 0.821 0.408 0.769 0.290 0.316
## Aprov.Ingles 0.510 0.439 0.462 0.270 0.280
## Aprov.Matem 1.000 0.421 0.712 0.326 0.351
## Aprov.Espanol 0.421 1.000 0.569 0.350 0.372
## IGS 0.712 0.569 1.000 0.474 0.485
## Freshmen.GPA 0.326 0.350 0.474 1.000 0.750
## Grad..GPA 0.351 0.372 0.485 0.750 1.000
In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.
Question: was the draft was really random?
Here we have two quantitative variables, so we start with the scatter plot:
ggplot(draft, aes(Draft.Number, Day.of.Year)) +
geom_point() +
labs(x="Day of Year", y="Draft Number")
and this does not look like there is a problem with independence.
However:
cor.test(draft$Draft.Number, draft$Day.of.Year)$p.value
## [1] 1.263829e-05
not so often needed but sometimes useful, this command also finds confidence intervals for correlations:
round(as.numeric(cor.test(draft$Draft.Number, draft$Day.of.Year)$conf.int), 3)
## [1] -0.321 -0.126
Data is from O’Carroll PW, Alkon E, Weiss B. Drowning mortality in Los Angeles County, 1976 to 1984, JAMA, 1988 Jul 15;260(3):380-3.
Drowning is the fourth leading cause of unintentional injury death in Los Angeles County. They examined data collected by the Los Angeles County Coroner’s Office on drownings that occurred in the county from 1976 through 1984. There were 1587 drownings (1130 males and 457 females) during this nine-year period
kable.nice(drownings)
Male | Female | |
---|---|---|
Private Swimming Pool | 488 | 219 |
Bathtub | 115 | 132 |
Ocean | 231 | 40 |
Freshwater bodies | 155 | 19 |
Hottubs | 16 | 15 |
Reservoirs | 32 | 2 |
Other Pools | 46 | 14 |
Pails, basins, toilets | 7 | 4 |
Other | 40 | 12 |
Here we have two categorical variables (Method of Drowning and Gender), both categorical. We want to know whether the variables are independent.
Because the gender totals are so different, it makes good sense to look at percentages instead:
drownings$'Male(%)' <-
round(drownings$Male/sum(drownings$Male)*100, 1)
drownings$'Female(%)' <-
round(drownings$Female/sum(drownings$Female)*100, 1)
kable.nice(drownings[, 3:4])
Male(%) | Female(%) | |
---|---|---|
Private Swimming Pool | 43.2 | 47.9 |
Bathtub | 10.2 | 28.9 |
Ocean | 20.4 | 8.8 |
Freshwater bodies | 13.7 | 4.2 |
Hottubs | 1.4 | 3.3 |
Reservoirs | 2.8 | 0.4 |
Other Pools | 4.1 | 3.1 |
Pails, basins, toilets | 0.6 | 0.9 |
Other | 3.5 | 2.6 |
Two events A and B are independent if and only if \(P(A \cap B)=P(A)P(B)\). In a contingency table this implies the following. Let’s add the row and column totals to the table:
x <- as.data.frame(drownings[, 1:2])
x$Total <- apply(x, 1, sum)
x <- rbind(x, Total=apply(x, 2, sum))
kable.nice(x)
Male | Female | Total | |
---|---|---|---|
Private Swimming Pool | 488 | 219 | 707 |
Bathtub | 115 | 132 | 247 |
Ocean | 231 | 40 | 271 |
Freshwater bodies | 155 | 19 | 174 |
Hottubs | 16 | 15 | 31 |
Reservoirs | 32 | 2 | 34 |
Other Pools | 46 | 14 | 60 |
Pails, basins, toilets | 7 | 4 | 11 |
Other | 40 | 12 | 52 |
Total | 1130 | 457 | 1587 |
so if the variables are independent we should have (for example)
P(Male and Private Simming Pool)=P(Male)*P(Private Simming Pool)
But
P(Male and Private Simming Pool)=488/1587
P(Male)=1130/1587
P(Private Simming Pool)=707/1587
\(488/1587 \approx 1130/1587\times 707/1587\)
\(488 \approx 1587 (1130/1587\times 707/1587)\)
Now we call the data the “Observed” and the numbers on the right the Expected, because this is what we would expect if the null hypothesis is true. Finally we need a way to measure how far the O’s are from the E’s. This is done with the famous formula
\[ \chi^2 = \sum \frac{(O-E)^2}{E} \]
and it can be shown that under some mild conditions \(\chi^2\) has a \(\chi^2\) distribution with \((m-1)(k-1)\) degrees of freedom, where m and k are the number of categories in each variable.
So here we find:
E <- cbind(x[1:9, 3]*x[10, 1]/x[10, 3],
x[1:9, 3]*x[10, 2]/x[10, 3])
chi <- sum((x[1:9, 1:2]-E)^2/E)
c(chi, 1-pchisq(chi, (8-1)*(2-1)))
## [1] 144.4828 0.0000
This is called Pearson’s chi square test of independence. It is done with the command chisq.test and it has the assumption of no expected counts less than 5.
chisq.test(drownings[ 1:2])
##
## Pearson's Chi-squared test
##
## data: drownings[1:2]
## X-squared = 144.48, df = 8, p-value < 2.2e-16
Let’s see whether there is a problem with the assumptions:
round(chisq.test(drownings[, 1:2])$expected, 1)
## Male Female
## Private Swimming Pool 503.4 203.6
## Bathtub 175.9 71.1
## Ocean 193.0 78.0
## Freshwater bodies 123.9 50.1
## Hottubs 22.1 8.9
## Reservoirs 24.2 9.8
## Other Pools 42.7 17.3
## Pails, basins, toilets 7.8 3.2
## Other 37.0 15.0
and we see that the expected counts of Pails, basins, toilets and Female is 3.2. In real life this would be considered ok, but it would also be easy to fix:
newmale <- c(drownings[1:7, 1], 7+40)
newfemale <- c(drownings[1:7, 2], 4+12)
newdrown <- cbind(newmale, newfemale)
newdrown
## newmale newfemale
## [1,] 488 219
## [2,] 115 132
## [3,] 231 40
## [4,] 155 19
## [5,] 16 15
## [6,] 32 2
## [7,] 46 14
## [8,] 47 16
out <- chisq.test(newdrown)
round(out$expected, 1)
## newmale newfemale
## [1,] 503.4 203.6
## [2,] 175.9 71.1
## [3,] 193.0 78.0
## [4,] 123.9 50.1
## [5,] 22.1 8.9
## [6,] 24.2 9.8
## [7,] 42.7 17.3
## [8,] 44.9 18.1
round(out$p.value, 4)
## [1] 0