Basic Inferences

In this section we will discuss some of the standard (frequentist) methods in Statistics.

Inference for a Population Mean

The basic R command for inference for a population mean is t.test.

Confidence Intervals

Example: Mothers Cocain Use and Babies Health

Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use.

The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn.

Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome
Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff
Obstetrics and Gynecology 1989;74:882-885

Let’s ignore the drug status for the moment and find a \(90\%\) confidence interval for the length of a newborn baby:

t.test(mothers$Length)

## 
##  One Sample t-test
## 
## data:  mothers$Length
## t = 141.83, df = 93, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  48.85519 50.24269
## sample estimates:
## mean of x 
##  49.54894

the t.test command does a number of different things, if we want the interval alone we can use

round(as.numeric(t.test(mothers$Length)$conf.int), 2)

## [1] 48.86 50.24

The assumptions for this method are:

data comes from a normal distribution
or data set is large enough

Let’s check with the normal probability plot:

df <- data.frame(x=mothers$Length)
ggplot(df, aes(sample=x)) +
  stat_qq() + stat_qq_line()

This is fine.

Hypothesis Testing

Example: Resting Period of Monarch Butterflies

Some Monarch butterflies fly early in the day, others somewhat later. After the flight they have to rest for a short period. It has been theorized that the resting period (RIP) of butterflies flying early in the morning is shorter because this is a thermoregulatory mechanism, and it is cooler in the mornings. The mean RIP of all Monarch butterflies is 133 sec. Test the theory at the 10% level.

Research by Anson Lui, Resting period of early and late flying Monarch butterflies Danaeus plexippus, 1997

Parameter: mean \(\mu\)
Method: 1-sample t
Assumptions: normal data or large sample
\(\alpha = 0.1\)
\(H_0: \mu =133\) (RIP is the same for early morning flying butterflies as all others)
\(H_0: \mu <133\) (RIP is the shorter for early morning flying butterflies)

t.test(butterflies$RIP.sec., 
       mu=133, 
       alternative = "less")$p.value

## [1] 0.05583963

\(p = 0.0558 < \alpha = 0.1\), so we reject the null hypothesis
It appears the resting time is somewhat shorter, but the conclusion is not a strong one.

Checking the assumption:

df <- data.frame(x=butterflies$RIP.sec.)
ggplot(df, aes(sample=x)) +
  stat_qq() + stat_qq_line()

looks good.

Power Calculations

The power of a test is the probability to correctly reject the null if the null is false. The power will always depend on an assumed value for the mean.

Let’s say the true mean resting period is 120.7 seconds. What was the probability that Anson’s experiment would have detected this? That is, what is

\[ P_{\mu=120.7}(\text{reject null}) \]

First we need to find the critical region of this test, so we know what reject null actually means. Now if the test is done at the 5% level we reject the null if the p value is less than 0.05. How can we find out for what value of the sample mean this will happen? Let’s do the following:

generate data from a normal distribution with mean \(\mu\), standard deviation as in our data and the same number of observations
find the p value and check whether it is < 0.05
repeat many times and find average.
use trial and error on \(\mu\) until the probability is (about) 0.05:

mean.M <- function(M, true.mu=133) {
  B <- 10000
  pvals <- rep(0, B)
  for(i in 1:B) {
    x <- rnorm(length(butterflies$RIP.sec.), M, 
               sd(butterflies$RIP.sec.))
    pvals[i] <- t.test(x, mu=true.mu, 
       alternative = "less")$p.value
  }
  1-sum(pvals<0.05)/B
}
mean.M(120)

## [1] 0.2399

mean.M(110)

## [1] 0.0066

mean.M(115)

## [1] 0.0532

just about right! We now know that “reject null” means \(\bar{X}<115\).

Now we turn this around: if the true mean where 120.5, what is the probability that we would reject the null, that is get a sample mean of 115 or less? Actually, we can again use the same routine:

mean.M(115, true.mu = 120.7)

## [1] 0.724

Of course here this can also be done analytically: \[ \begin{aligned} &P_{\mu=133}(\text{ reject null }) = \\ &P_{\mu=133}(\bar{X}< \text{crit}) = 0.05 \\ \end{aligned} \] Now \(\bar{X} \sim N(133, s/\sqrt{40})\), so

crit <- qnorm(0.05, 133, sd(butterflies$RIP.sec.)/sqrt(40))
crit

## [1] 124.0558

(Actually the distribution is t, not a normal, but we will ignore this here)

and now

\[ \begin{aligned} &P_{\mu=120.7}(\text{ reject null }) = \\ &P_{\mu=120.7}P(\bar{X}<124.06) =\\ \end{aligned} \]

pnorm(124.06, 120.7, sd(butterflies$RIP.sec.)/sqrt(40))

## [1] 0.7316819

and of course we get the same answer.

There is also an R command to do this:

power.t.test(n=40, 
             delta=133-120.7, 
             sd=sd(butterflies$RIP.sec.),
             alternative = "one.sided", 
             type = "one.sample")

## 
##      One-sample t test power calculation 
## 
##               n = 40
##           delta = 12.3
##              sd = 34.39108
##       sig.level = 0.05
##           power = 0.7182354
##     alternative = one.sided

Usually one wants to study the power for a whole range of values. This is done by drawing a power curve:

x <- seq(110, 133, 0.1)
y <- x
for(i in seq_along(x))
  y[i] <- power.t.test(n=40, 
             delta=133-x[i], 
             sd=sd(butterflies$RIP.sec.),
             alternative = "one.sided", 
             type = "one.sample")$power
df <- data.frame(Mean=x, Power=y)
ggplot(df, aes(Mean, Power)) +
  geom_line(col="blue", size=1.2)

Sample Size

so, if the true mean resting period is 120.7 seconds the power is 72%. What sample size would we need to have a power of 95%

round(power.t.test(power=0.95, 
             delta=133-120.7, 
             sd=sd(butterflies$RIP.sec.),
             alternative = "one.sided", 
             type = "one.sample")$n)

## [1] 86

The sample size issue also arises when we want to find a confidence interval. Here the number that corresponds to the power is the error E, that is half the length of the interval.

Analytically we find

\[ \begin{aligned} &\bar{X} \pm z_{\alpha/2}s/\sqrt{n} \\ &E= z_{\alpha/2}s/\sqrt{n}\\ &n = (\frac{z_{\alpha/2}s}{E})^2 \\ \end{aligned} \] let’s see

I <- t.test(butterflies$RIP.sec.)$conf.int
diff(I)/2

## [1] 10.9988

so a 95% confidence interval has an error of 11. If we wanted an error of 5 we would need a sample size of

round((qnorm(0.975)*sd(butterflies$RIP.sec.)/5)^2)

## [1] 182

Inference for a Population Proportion

The R routine for inference for a proportion (or a probability or a percentage) is binom.test. This implements a method by Clopper and Pearson (1934). This method is exact and has no assumptions.

Note The formula discussed in many introductory statistic courses for the confidence interval is

\[ \hat p \pm z_{\alpha/2}\sqrt{\frac{\hat p (1-\hat p)}{n} } \]

where \(\hat p\) is the proportion of success. This leads to confidence intervals that are now known to be quite wrong, and so this method should not be used anymore. The same is true for the corresponding hypothesis test. This method (actually a slight improvement due to Wilson (1927)) is implemented in R by prop.test.

Example: Jon Kerrichs Coin

The South African Jon Kerrich spent some time in a German prisoner of war camp during world war I. He used his time to flip a coin 10000 times, resulting in 5067 heads.

Test at the 5% level of significance whether 5067 heads in 10000 flips are compatible with a fair coin.

Parameter: proportion \(\pi\)
Method: exact binomial
Assumptions: None
\(\alpha = 0.05\)
\(H_0: \pi = 0.5\) (50% of flips result in “Heads”, coin is fair)
\(H_a: \pi \ne 0.5\) (coin is not fair)

binom.test(x = 5067, n = 10000)$p.value

## [1] 0.1835155

\(p = 0.1835 > \alpha=0.05\), so we fail to reject the null hypothesis.
it appears Jon Kerrich’s coin was indeed fair.

Example: Sample Size for Polling

Say some polling institute wants to conduct a poll for the next election for president. They will then find a 95% confidence interval and they want this interval to have an error of 3 percentage points (aka \(\pm 0.03\)). What sample size do they need?

In American politics the two parties are always very close, so in a poll with n people about n/2 will vote for one or the other party. Let’s do a little trial and error:

n <- 100
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)

## [1] 0.1016789

Now that is to large, so

n <- 200
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)

## [1] 0.07134157

n <- 400
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)

## [1] 0.05009211

n <- 800
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)

## [1] 0.03521797

n <- 1200
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)

## [1] 0.02867679

n <- 1100
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)

## [1] 0.02996843

n <- 1050
diff(as.numeric(binom.test(n/2, n)$conf.int)/2)

## [1] 0.03068294

Although the confidence interval using the formula

\[ \hat p \pm z_{\alpha/2}\sqrt{\frac{\hat p (1-\hat p)}{n} } \]

should not be used for calculating an interval, it it useful when considering the sample size question: above we assume \(\hat{p} \approx 1/2\), so

\[ \begin{aligned} &E = z_{\alpha/2}\sqrt{\frac{\hat p (1-\hat p)}{n}} \\ &n = \frac{z_{\alpha/2}^2\hat p (1-\hat p)}{E^2} = \\ &\frac{1.96^2 \frac12 (1-\frac12)}{0.03^2}= 1067\\ \end{aligned} \] There is a nice feature of this formula: note that \(\hat p (1-\hat p)<\frac14\) for all \(\hat{p}\), so no matter what the actual value of \(\pi\) is, a sample of size \(n=(\frac{z_{\alpha/2}}{2E})^2\) will always be enough. If we need a \(95\%\) interval, then \(z_{\alpha/2}=1.96 \approx 2\), and we have \(n=1/E^2\).

There is something quite remarkable about these formulas!

Example Jon Kerrich’s coin

If the goal is to do a hypothesis test there is a library we can use called pwr:

if the true probability of the coin to come up heads was 0.505, what was the power of the test?

library(pwr)
pwr.p.test(ES.h(0.505, 0.5), n=10000)

## 
##      proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.01000017
##               n = 10000
##       sig.level = 0.05
##           power = 0.1700792
##     alternative = two.sided

how many flips would have been needed for him to detect that such a coin was unfair with a probability of \(90\%\)?

pwr.p.test(ES.h(0.505, 0.5), power=0.9)$n

## [1] 105070.7

Note R has the routine power.prop.test built in, but it is for the two-sample proportion problem.

Two Means - 2 Sample Problem

Example Mothers and Cocaine Use.

ggplot(data=mothers, aes(Status, Length)) +
  geom_boxplot()

Let’s say we want to find a \(90\%\) confidence interval for the difference in lengths of babies in Drug Free and Throughout groups:

round(as.numeric(
  t.test(mothers$Length[mothers$Status=="Drug Free"],
       mothers$Length[mothers$Status=="Throughout"],
       conf.level=0.9)$conf.int
), 2)

## [1] 1.83 4.37

There is another version of this problem, called the paired data problem. Say we have data on particiants in a weight loss program, and for each we know the weight before and after the program. In this case one calculates the difference and uses the one sample methods.

Two Proportions

In a study 300 men and 200 women were asked whether they had been dining out during the last week. 187 of the men and, 112 of the women said yes.

Test at the \(5\%\) level whether there is a stat. significant difference between the genders.

round(prop.test(c(187, 112), c(300, 200))$p.value, 3)

## [1] 0.186

this test is based on the normal approximation, so it has the same issues as the one-sample prop test. As an alternative we can use Fisher’s exact test:

fisher.test(cbind(c(187, 300-187), c(112, 200-112)))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  cbind(c(187, 300 - 187), c(112, 200 - 112))
## p-value = 0.1636
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.888221 1.901356
## sample estimates:
## odds ratio 
##   1.299544

There is an important difference between these tests: the first one assumes that we randomly selected 500 people. 300 turned out to be male, 187 of whom said they had been out dining during the last week. In Fisher’s exact test the 300 is assumed to have been fixed before the experiment (aka it is not random).

Correlation

Example UPR Admissions data

What are the correlations between the various variables?

head(upr, 2)

##      ID.Code Year Gender Program.Code Highschool.GPA Aptitud.Verbal
## 1 00C2B4EF77 2005      M          502           3.97            647
## 2 00D66CF1BF 2003      M          502           3.80            597
##   Aptitud.Matem Aprov.Ingles Aprov.Matem Aprov.Espanol IGS Freshmen.GPA
## 1           621          626         672           551 342         3.67
## 2           726          618         718           575 343         2.75
##   Graduated Year.Grad. Grad..GPA Class.Facultad
## 1        Si       2012      3.33           INGE
## 2        No         NA        NA           INGE

Let’s take out the those variables that are either not numerical or not useful for prediction:

x <- upr[, -c(1, 2, 3, 4, 13, 14, 16)]
head(x, 2)

##   Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles Aprov.Matem
## 1           3.97            647           621          626         672
## 2           3.80            597           726          618         718
##   Aprov.Espanol IGS Freshmen.GPA Grad..GPA
## 1           551 342         3.67      3.33
## 2           575 343         2.75        NA

round(cor(x, use = "complete.obs") ,3)

##                Highschool.GPA Aptitud.Verbal Aptitud.Matem Aprov.Ingles
## Highschool.GPA          1.000          0.205         0.163        0.056
## Aptitud.Verbal          0.205          1.000         0.497        0.519
## Aptitud.Matem           0.163          0.497         1.000        0.474
## Aprov.Ingles            0.056          0.519         0.474        1.000
## Aprov.Matem             0.213          0.515         0.821        0.510
## Aprov.Espanol           0.268          0.604         0.408        0.439
## IGS                     0.666          0.734         0.769        0.462
## Freshmen.GPA            0.395          0.339         0.290        0.270
## Grad..GPA               0.385          0.349         0.316        0.280
##                Aprov.Matem Aprov.Espanol   IGS Freshmen.GPA Grad..GPA
## Highschool.GPA       0.213         0.268 0.666        0.395     0.385
## Aptitud.Verbal       0.515         0.604 0.734        0.339     0.349
## Aptitud.Matem        0.821         0.408 0.769        0.290     0.316
## Aprov.Ingles         0.510         0.439 0.462        0.270     0.280
## Aprov.Matem          1.000         0.421 0.712        0.326     0.351
## Aprov.Espanol        0.421         1.000 0.569        0.350     0.372
## IGS                  0.712         0.569 1.000        0.474     0.485
## Freshmen.GPA         0.326         0.350 0.474        1.000     0.750
## Grad..GPA            0.351         0.372 0.485        0.750     1.000

Example: The 1970’s Military Draft

In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.

Question: was the draft was really random?

Here we have two quantitative variables, so we start with the scatter plot:

ggplot(draft, aes(Draft.Number, Day.of.Year)) +
     geom_point() +
     labs(x="Day of Year", y="Draft Number")

and this does not look like there is a problem with independence.

However:

Parameter: Pearson’s correlation coefficient \(\rho\)
Method: Test for Pearson’s correlation coefficient \(\rho\)
Assumptions: relationship is linear and data comes from a bivariate normal.
\(\alpha = 0.05\)
\(H_0: \rho =0\) (no relationship between Day of Year and Draft Number)
\(H_a: \rho \ne 0\) (some relationship between Day of Year and Draft Number)

cor.test(draft$Draft.Number, draft$Day.of.Year)$p.value

## [1] 1.263829e-05

\(p=0.0000 <\alpha = 0.05\), so we reject the null hypothesis,
There is a statistically significant relationship between Day of Year and Draft Number.

not so often needed but sometimes useful, this command also finds confidence intervals for correlations:

round(as.numeric(cor.test(draft$Draft.Number, draft$Day.of.Year)$conf.int), 3)

## [1] -0.321 -0.126

Categorical Data Analysis - Test for Independence

Example: Drownings in Los Angeles

Data is from O’Carroll PW, Alkon E, Weiss B. Drowning mortality in Los Angeles County, 1976 to 1984, JAMA, 1988 Jul 15;260(3):380-3.

Drowning is the fourth leading cause of unintentional injury death in Los Angeles County. They examined data collected by the Los Angeles County Coroner’s Office on drownings that occurred in the county from 1976 through 1984. There were 1587 drownings (1130 males and 457 females) during this nine-year period

kable.nice(drownings)

	Male	Female
Private Swimming Pool	488	219
Bathtub	115	132
Ocean	231	40
Freshwater bodies	155	19
Hottubs	16	15
Reservoirs	32	2
Other Pools	46	14
Pails, basins, toilets	7	4
Other	40	12

Here we have two categorical variables (Method of Drowning and Gender), both categorical. We want to know whether the variables are independent.

Because the gender totals are so different, it makes good sense to look at percentages instead:

drownings$'Male(%)' <- 
  round(drownings$Male/sum(drownings$Male)*100, 1)
drownings$'Female(%)' <- 
  round(drownings$Female/sum(drownings$Female)*100, 1)
kable.nice(drownings[, 3:4])

	Male(%)	Female(%)
Private Swimming Pool	43.2	47.9
Bathtub	10.2	28.9
Ocean	20.4	8.8
Freshwater bodies	13.7	4.2
Hottubs	1.4	3.3
Reservoirs	2.8	0.4
Other Pools	4.1	3.1
Pails, basins, toilets	0.6	0.9
Other	3.5	2.6

Two events A and B are independent if and only if \(P(A \cap B)=P(A)P(B)\). In a contingency table this implies the following. Let’s add the row and column totals to the table:

x <- as.data.frame(drownings[, 1:2])
x$Total <- apply(x, 1, sum)
x <- rbind(x, Total=apply(x, 2, sum))
kable.nice(x)

	Male	Female	Total
Private Swimming Pool	488	219	707
Bathtub	115	132	247
Ocean	231	40	271
Freshwater bodies	155	19	174
Hottubs	16	15	31
Reservoirs	32	2	34
Other Pools	46	14	60
Pails, basins, toilets	7	4	11
Other	40	12	52
Total	1130	457	1587

so if the variables are independent we should have (for example)

P(Male and Private Simming Pool)=P(Male)*P(Private Simming Pool)

But

P(Male and Private Simming Pool)=488/1587
P(Male)=1130/1587
P(Private Simming Pool)=707/1587

\(488/1587 \approx 1130/1587\times 707/1587\)

\(488 \approx 1587 (1130/1587\times 707/1587)\)

Now we call the data the “Observed” and the numbers on the right the Expected, because this is what we would expect if the null hypothesis is true. Finally we need a way to measure how far the O’s are from the E’s. This is done with the famous formula

\[ \chi^2 = \sum \frac{(O-E)^2}{E} \]

and it can be shown that under some mild conditions \(\chi^2\) has a \(\chi^2\) distribution with \((m-1)(k-1)\) degrees of freedom, where m and k are the number of categories in each variable.

So here we find:

E <- cbind(x[1:9, 3]*x[10, 1]/x[10, 3],
           x[1:9, 3]*x[10, 2]/x[10, 3])
chi <- sum((x[1:9, 1:2]-E)^2/E)
c(chi, 1-pchisq(chi, (8-1)*(2-1)))

## [1] 144.4828   0.0000

This is called Pearson’s chi square test of independence. It is done with the command chisq.test and it has the assumption of no expected counts less than 5.

Parameters of interest: measure of association
Method of analysis: chi-square test of independence
Assumptions of Method: all expected counts greater than 5
Type I error probability \(\alpha\)=0.05
H₀: Classifications are independent = there is no difference in the method of drowning between men and women.
H_a: Classifications are dependent = there is some difference in the method of drowning between men and women.

chisq.test(drownings[ 1:2])

## 
##  Pearson's Chi-squared test
## 
## data:  drownings[1:2]
## X-squared = 144.48, df = 8, p-value < 2.2e-16

p = 0.000 < \(\alpha\)=0.05, we reject the null hypothesis, there is a statistically significant difference between men and women and where they drown.

Let’s see whether there is a problem with the assumptions:

round(chisq.test(drownings[, 1:2])$expected, 1)

##                         Male Female
## Private Swimming Pool  503.4  203.6
## Bathtub                175.9   71.1
## Ocean                  193.0   78.0
## Freshwater bodies      123.9   50.1
## Hottubs                 22.1    8.9
## Reservoirs              24.2    9.8
## Other Pools             42.7   17.3
## Pails, basins, toilets   7.8    3.2
## Other                   37.0   15.0

and we see that the expected counts of Pails, basins, toilets and Female is 3.2. In real life this would be considered ok, but it would also be easy to fix:

newmale <- c(drownings[1:7, 1], 7+40)
newfemale <- c(drownings[1:7, 2], 4+12)
newdrown <- cbind(newmale, newfemale)
newdrown

##      newmale newfemale
## [1,]     488       219
## [2,]     115       132
## [3,]     231        40
## [4,]     155        19
## [5,]      16        15
## [6,]      32         2
## [7,]      46        14
## [8,]      47        16

out <- chisq.test(newdrown) 
round(out$expected, 1)

##      newmale newfemale
## [1,]   503.4     203.6
## [2,]   175.9      71.1
## [3,]   193.0      78.0
## [4,]   123.9      50.1
## [5,]    22.1       8.9
## [6,]    24.2       9.8
## [7,]    42.7      17.3
## [8,]    44.9      18.1

round(out$p.value, 4)

## [1] 0