Hypothesis Testing Basics

Basic Idea

For a more detailed discussions of issues arising in hypothesis testing see my page at academic.uprm.edu/wrolke/esma3101/hyptest.html.

For a talk I gave in the Department seminar on the controversy of hypothesis testing see academic.uprm.edu/wrolke/research/WhatiswrongwithHT.pdf

A hypothesis is a statement about a population parameter. In its most general form it is as follows: we have data \(x_1, .., x_n\) from some density \(f(x|\theta)\). We want to test

\[ H_0: \theta \in \Theta_0 \text{ vs }H_a: \theta \notin \Theta_0 \] for some subset of the parameter space \(\Theta_0\).

Example (5.1.1)

X~Ber(p), \(\Theta=[0,1]\), \(\Theta_0=\{0.5\}\), so we are testing whether p=0.5

Example (5.1.2)

\(X\sim N(\mu,\sigma)\), \(\Theta =\left\{(x,y): x \in R, y>0\right\}\), \(\Theta_0 =\left\{(x,y): x>100, y>0\right\}\), so we are testing whether \(\mu>100\).


In addition to the null hypothesis we usually (but not always) also write down the alternative hypotheses \(H_a\), usually (but not always) the complement of \(\Theta_0\). So a hypothesis test makes a choice between \(H_0\) and \(H_a\).

A hypothesis that “fixes” the parameter (\(\theta=\theta_0\)) is called simple, otherwise it is called composite (for example \(\theta>\theta_0\))


A complete hypothesis test should have all of the following parts:

  1. Parameter

  2. Method

  3. Assumptions

  4. Type I error probability \(\alpha\)

  5. Null hypothesis \(H_0\)

  6. Alternative hypothesis \(H_a\)

  7. Test statistic

  8. Rejection region

  9. Conclusion

Example (5.1.3)

Over the last five years the average score in the final exam of a course was 73 points. This semester a class with 27 students used a new textbook, and the mean score in the final was 78.1 points with a standard deviation of 7.1.

Question: did the class using the new text book do (statistically significantly) better?

For this specific example the complete hypothesis test might look as follows:

  1. Parameter: mean

  2. Method: one-sample t

  3. Assumptions: normal data or large sample

  4. \(\alpha = 0.05\)

  5. \(H_0: \mu_0 = 73\)

  6. \(H_a: \mu_0 > 73\)

\(T = \sqrt{n}\frac{\bar{x}-\mu_0}{s}=3.81\)

  1. reject \(H_0\) if \(T>qt(1-0.05,26) = 1.706\)

  2. T = 3.81 > 1.706, so we reject the null hypothesis, it appears that the mean score in the final is really higher.


In the 9 parts of a hypothesis test, the first 6 (at least in theory) should be done before looking at the data. The following is not allowed: say we did a study of students at the Colegio. We asked them many questions. Afterwards we computed correlation coefficients for all the pairs of variables and found a high correlation between “Income” and “GPA”. Then we carried out a hypothesis test \(H_0: \rho=0\) vs \(H_a: \rho \ne 0\).

The problem here is that this hypothesis test was suggested to us by the data, but (most standard) hypothesis tests only work as advertised if the hypotheses are formulated without consideration of the data.

Go back to our example of the new textbook. Here we have the following:

Correct: we pick \(H_a: \mu >73\) because we want to proof that the new textbook works better than the old one.

Wrong: we pick \(H_a: \mu >73\) because the sample mean score was 78.1, so if anything the new scores are higher than the old ones.

Type I and Type II errors

When we carry out a hypothesis test in the end we always face one of the following situations:

\[ \begin{array}{ccc} \hline &\text{State of} &\text{Nature}\\ \hline &H_0\text{ is true} &H_0\text{ is false}\\ \hline \text{accept }H_0&\text{OK}&\text{ type II error}\\ \text{reject }H_0&\text{ type I error}&\text{OK} \\ \hline \end{array} \]

In statistics when we do a hypothesis test we decide ahead of time what we are willing to accept as a type I error \(\alpha\), and then accept whatever the type II error \(\beta\) is. Generally, if you make \(\alpha\) smaller, thereby reducing the probability of falsely rejecting the null hypothesis you make \(\beta\) larger, that is you increase the probability of falsely accepting a wrong null hypothesis. The only way to make both \(\alpha\) and \(\beta\) smaller is by increasing the sample size n.

How do you choose \(\alpha\)? This in practice is a very difficult question. What you need to consider is the consequences of the type I and the type II errors.

Many fields such as psychology, biology etc. have developed standards over the years. The most common one is \(\alpha = 0.05\), and we will use this if nothing else is said.

p-value

In real live a frequentist hypothesis test is usually done by computing the p-value, that is the probability to observe the data or something even more extreme given that the null hypothesis is true.

Example (5.1.4)

p=P(mean score on final exam > 78.2 | \(\mu\) = 73)

1-pnorm(78.2, 73, 7.1/sqrt(27))
## [1] 7.072106e-05

Then the decision is made as follows:

  • \(p< \alpha \rightarrow\) reject \(H_0\)
  • \(p> \alpha \rightarrow\) fail to reject \(H_0\)

The advantage of the p value approach is that in addition to the decision on whether or not to reject the null hypothesis it also gives us some idea on how close a decision it was. If \(p=0.035<\alpha=0.05\) it was a close decision, if \(p=0.0001<\alpha=0.05\) it was not.

The p-value depends on the observed sample, which is a random variable, so it in turn is a random variable. What is its distribution?

Example (5.1.5)

say \(X\sim N(\mu,1)\) and we want to test

\[H_0: \mu=0 \text{ vs }H_1: \mu>0\]

Note: this is in fact a general example for data from a normal distribution because if we have a sample \(X_1,..,X_n\) and want to do inference for \(\mu\), we immediately go to \(\bar{X}\sim N(0,\sigma/\sqrt{n})\).

Clearly we should reject the null hypothesis if \(x\) is large, so the rejection region is of the form \(\left\{x : x>c \right\}\).

What is c? It depends on the type I error probability \(\alpha\):

\[ \begin{aligned} &\alpha =P\left(\text{reject }H_0\vert H_0\text{ true} \right) =\\ &P(X>c\vert \mu=0) = \\ &1-P(X<c\vert \mu=0) \\ &P(X<c\vert \mu=0) = 1-\alpha\\ &c=qnorm(1-\alpha)=:z_{\alpha} \end{aligned} \]

Now let \(Y\sim N(0,1)\), independent of \(X\), and assume we observe X=x, then

\[\text{p value } = P(Y>x\vert\mu=0) = 1-\Phi(x)=:p\] p is a number calculated from the data, it is a statistics, and therefore it is random and has a probability distribution. Let \(F_p\) be this distribution, then we find

\[ \begin{aligned} &F_p(t) =P(p<t) =\\ &P\left(1-\Phi(X)<t\right)=\\ &P\left(\Phi(X)>1-t\right) =\\ &P\left(X>\Phi^{-1}(1-t)\right) =\\ &1-P\left(X<\Phi^{-1}(1-t)\right)=\\ &1-\Phi\left(\Phi^{-1}(1-t)\right) =\\ &1-(1-t)=t \end{aligned} \]

and so \(p\sim U[0,1]\).

So if the null hypothesis is true the distribution of the p-value is uniform [0,1]. Notice that in this derivation we made no use of the fact that \(\Phi\) is a normal cdf, except that \(\Phi^{-1}\) exists. So this turns out to be true in general for all continuous distributions.

Note it is not exactly true for discrete distributions because then \(P(p=p_0)\ne 0\), but it is usually almost true.

Let’s do a simulation to see how the p values look when the null is false:

bw <- 1/50
pushViewport(viewport(layout = grid.layout(2, 2)))
df <- data.frame(pvalue=1-pnorm(rnorm(1000)))
print(ggplot(df, aes(pvalue)) +
         geom_histogram(aes(y = ..density..),
            color = "black", fill = "white", binwidth = bw) +
        labs(title=expression(mu~"= 0.0")),
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
df <- data.frame(pvalue=1-pnorm(rnorm(1000, 0.5)))
print(ggplot(df, aes(pvalue)) +
         geom_histogram(aes(y = ..density..),
            color = "black", fill = "white", binwidth = bw)+
        labs(title=expression(mu~"= 0.5")),
  vp=viewport(layout.pos.row=1, layout.pos.col=2))        
df <- data.frame(pvalue=1-pnorm(rnorm(1000, 1)))
print(ggplot(df, aes(pvalue)) +
         geom_histogram(aes(y = ..density..),
            color = "black", fill = "white", binwidth = bw)+
        labs(title=expression(mu~"= 1.0")),
  vp=viewport(layout.pos.row=2, layout.pos.col=1))
df <- data.frame(pvalue=1-pnorm(rnorm(1000, 2)))
print(ggplot(df, aes(pvalue)) +
         geom_histogram(aes(y = ..density..),
            color = "black", fill = "white", binwidth = bw)+
        labs(title=expression(mu~"= 2.0")),
  vp=viewport(layout.pos.row=2, layout.pos.col=2))        

or we can just calculate it. Say the true mean is \(\mu_1\), and denote by \(\phi(\cdot;\mu)\) the cdf of a normal with mean \(\mu\). Then

\[ \begin{aligned} &P(p(Y)<t) = \\ &P(1-\Phi(Y;\mu_0) < t) = \\ &P(\Phi(Y;\mu_0)>1-t) = \\ &P(Y > \Phi^{-1}(1-t;\mu_0)) = \\ &1-P(Y< \Phi^{-1}(1-t;\mu_0)) = \\ &1-\Phi(\Phi^{-1}(1-t;\mu_0);\mu_1) \end{aligned} \] this is the cdf, for the density we would need to differentiate this expression. Or we can use numerical differentiation:

ppval <- function(t, mu)
  1-pnorm(qnorm(1-t)-mu)
t <- seq(0.01, 1, length=250)
dpval <- function(t, mu, h=10^-6)
  (ppval(t+h, mu=mu)-ppval(t, mu=mu))/h
df1 <- data.frame(p=c(t, t, t), 
        y=c(dpval(t, 1), dpval(t, 2), dpval(t, 3)),
        mu=factor(rep(1:3, each=250)))    
ggplot(df1, aes(p, y, color=mu)) +
  geom_line(size=1.2)

Bayesian Hypothesis Testing

As always in Bayesian statistics we need a prior. In the case of hypothesis testing this becomes especially tricky. To begin with, if we wanted to test the hypothesis \(H_0: \theta=\theta_0\) we would need to start with a prior that puts some probability on the point \(\left\{\theta_0\right\}\), otherwise the hypothesis will always be rejected. If we do that we can simply compute \(P(H_0\) is true | data), and if this probability is smaller than some thresh-hold (similar to the type I error) we reject the null hypothesis.

Instead of the probability P(\(H_0\) is true | data) we often compute the Bayes factor, given as follows: say \(X_1, .., X_n \sim f(x|\theta)\) and \(\theta\sim g\), then the posterior density is

\[g(\theta | \pmb{x} ) \sim L(\theta)g(\theta)\]

The belief about \(H_0\) before the experiment is described by the prior odds ratio

\[ \frac{P(\theta \in \Theta_0)}{P(\theta \in \Theta_1)} \]

and belief about \(H_0\) after the experiment is described by the posterior odds ratio \[ \frac{P(\theta \in \Theta_0|\pmb{x})}{P(\theta \in \Theta_1|\pmb{x})} \]

The Bayes factor is then the ratio of the posterior to the prior odds ratios (a ratio of ratios)

(Jeffreys-)Lindley Paradox

Say we have \(X_1,..,X_n \sim N(\mu,1)\) and we want to test \(H_0:\mu=0\) vs \(H_a: \mu \ne 0\). Specifically, say we have n=10 and \(\bar{x}=0.75\), then the p-value is

n <- 10; xbar <- 0.75
2*(1-pnorm(xbar, 0, 1/sqrt(n)))
## [1] 0.01770607

and so we would reject the null hypothesis at the 5% level.

Now for a Bayesian analysis. As a prior let’s use the following: with probability \(\lambda\) the null is true. Otherwise \(\mu\sim N(0, 10)\). We can find the posterior probability that the null is true via simulation:

JL <- function(B=1e6, n=10, xbar=0.75, lambda=1/2) {
  mu <- c(rep(0, lambda*B), rnorm((1-lambda)*B, 0, 10))
  x <- rnorm(B, mu, 1/sqrt(n))
  mu <- mu[round(x, 2)==xbar]
  sum(mu==0)/length(mu)
}
JL()
## [1] 0.6247849

and so there is a (slight) preference for the null!

So the answer from a Frequentist and from a Bayesian analysis differ. This is often called the (Jeffrey’s-) Lindley paradox.

Example (5.1.6)

Here is another example, due to Spanos (2013), from high energy physics that has been cited in the literature: We have a very large number of collisions, n=527135, which are either of type A or type B. We have k=106298 type A collisions. Theory suggests P(A)=0.2. So we want to test

\[H_0:\pi=0.2 \text{ vs }H_1:\pi\ne 0.2\]

  • Frequentist: with such large numbers we can use a test based on the central limit theorem:

\[ \begin{aligned} &p=2P(\frac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}>\frac{k/n-0.2}{\sqrt{0.2\times0.8/n}}) = \\ &P(Z>2.999) = 0.0027 \end{aligned} \] and so we have strong evidence that the null is false.

  • Bayesian

We use the following priors on: P(H0)=1/2, and the other 1/2 is spread evenly on [0,1]. So we have

\[ \begin{aligned} &P(k|H_0) ={{n}\choose{k}}0.2^k0.8^{n-k} \\ &P(k|H_1) =\int_0^1 {{n}\choose{k}}p^k(1-p)^{n-k}dp =\\ &\int_0^1 \frac{n!}{(n-k)!k!}p^k(1-p)^{n-k}dp = \\ &\frac1{(n+1)}\int_0^1 \frac{\Gamma(n+2)}{\Gamma(n-k+1)\Gamma(k+1)}p^{k+1-1}(1-p)^{n-k+1-1}dp = \\ &\frac1{n+1} \end{aligned} \]

so the Bayes factor is

\[\frac{P(k|H_0)P(H_1)}{P(k|H_1)P(H_0)}\]

n <- 527135; k <- 106298
p0 <- dbinom(k, n, 0.2)
p1 <- 1/(n+1)
(p0*0.5)/(p1*0.5)
## [1] 8.114854

A Bayes factor of 8.1 would be considered some evidence in favor of the null. So again we have a disagreement!


The paradox is often used as an indictment of Frequentist statistics: Bayes and Frequentist disagree, Bayes is right, so Frequentist is wrong!

But who’s to say Bayes is right?

To start, Frequentist statistics and Bayesian statistics focus on different ideas, there is no reason that they should agree (although we hope they often do).

Often whether or not there is a paradox depends on the \(\lambda\):

JL(lambda=0.3)
## [1] 0.4255319

and the alternative has the higher posterior!

Also is sometimes goes away when the sample size grows:

n <- 20
2*(1-pnorm(xbar, 0, 1/sqrt(n)))
## [1] 0.0007962302
JL(n=n)
## [1] 0.1691542

but there are examples where that doesn’t happen.