Hypothesis Testing

First off, hypothesis testing is a rather complicate business. We will here discuss just one method for developing a hypothesis tests. Also, there are many issues involved in testing that are not of a mathematical nature. Unless you have previously taken a course on Statistics I highly recommend that you read the discussion in ESMA 3101 - Hypothesis Testing.

We recently discussed the issue of people misinterpreting confidence intervals. The same is true to an even greater extend for hypothesis testing. In fact, hypothesis testing is quite controversial in some applied fields. For a discussion of the reasons why see a talk that I gave in the Math department seminar some time ago entitled What’s wrong with Hypothesis Testing?

General Problem Statement

A hypothesis test is a general methodology for checking whether some data is compatible with some theory. In many (but not all) cases this amounts to a test for some parameter(s).

We have the following general setup: we have data \(x_1, .., x_n\) from some density \(f(x|\theta)\). We want to test

\[ H_0: \theta \in \Theta_0 \text{ vs }H_a: \theta \notin \Theta_0 \] for some subset of the parameter space \(\Theta_0\).

Example: Coin Tossing

we flip a coin 1000 times and get 549 heads. Is this a fair coin?

Here the parameter of interest \(\theta\) is the probability of heads in one flip of this coin. Each flip is a Bernoulli trial, so we have

\[ f(x|\theta)=\theta^x(1-\theta)^{1-x}, x=0,1 \]

A fair coin has \(\theta=0.5\), so \(\Theta_0=\left\{0.5\right\}\) and we can write the hypotheses as

\[ H_0: \theta =0.5 \text{ vs }H_a: \theta \ne 0.5 \]

Of course this is a standard statistics problem, with a standard solution. Let’s also consider two examples where the correct method is not obvious:

Example: Two Poisson means

A medical researcher carries out the following experiment: each day he gives 60 fruit flies either a poison type A or type B. In the evening he counts how many flies are dead. He finds:

kable.nice(cbind(poisons[1:5, ], poisons[56:60, ]))
Dead Poison Dead Poison
1 14 A 14 B
2 10 A 7 B
3 10 A 12 B
4 10 A 9 B
5 8 A 13 B

He wants to know whether the two poisons are equally effective.

Strictly speaking the number of dead flies follows a Binomial distribution with n trials and success probability \(\pi\), but because n is large and \(\pi\) is small it is OK to treat them as Poisson rv’s with rates \(\lambda_A\) and \(\lambda_B\). Then the hypotheses are

\[ H_0: \lambda_A = \lambda_B \text{ vs }H_a: \lambda_A \ne \lambda_B \] so here \(\Theta_0=\left\{(x,y):0\le x=y \right\}\)

Example: Beta parameters

Below is a sample from the Beta\((\alpha, \beta)\) distribution and we wish to test

\[ H_0: \alpha = \beta \text{ vs }H_a: \alpha > \beta \] so here we have \(\Theta_0=\left\{(x,y):0\le x = y \right\}\)

kable.nice(matrix(beta.sample[c(1:10, 191:200)], ncol=4))
## Error in sprintf("   <td%s> %s </td>", align, z): arguments cannot be recycled to the same length

Critical Region of a Test CR

this is the set of possible observations \((x_1, .., x_n)\) such that if this were the observed data we would reject the null hypothesis.

Example: Coin Tossing

we want to test

\[ H_0: \theta =0.5 \text{ vs }H_a: \theta \ne 0.5 \]

the mle of \(\pi\) is \(\hat{p} = x/n\) (here 549/1000 = 0.549). Under H0 \(\hat{p}\) should be close to 0.5, so a sensible critical region would be of the form

\[ |\hat{p}-0.5|>c \] for some number c. c is often called a critical value.

Example: Poisson means

the mle of a Poisson rate \(\lambda\) is the sample mean, so a test could be based on

\[ |\bar{X}_A - \bar{X}_B|>c \]

Example: Beta parameters

Let \(\hat{\alpha}, \hat{\beta}\) be the mle’s of \(\alpha\) and \(\beta\), the a critical region could be

\[ \hat{\alpha} - \hat{\beta} >c \]

Type I error, level of test \(\alpha\)

In the examples above we have chosen a general shape for the CR. But we still need to find the critical value c. This is done by first choosing \(\alpha\), the probability of the type I error. This in turn is the error to reject H0 although it is true.

Example: Coin Tossing

From the central limit theorem we know that if \(X_1,..,X_n \sim Bin(\pi)\)

\[ \sqrt{n}\frac{\hat{p}-\pi}{\sqrt{\pi(1-\pi)}} \sim N(0, 1) \]

so we find

\[ \begin{aligned} & \alpha = P_{\pi=0.5}({\text{reject }H_0}) = \\ &P_{\pi=0.5}(|\hat{p}-\pi|>c )= \\ &1-P_{\pi=0.5}(|\hat{p}-\pi|\le c) = \\ & 1-P_{\pi=0.5}(-c \le \hat{p}-\pi \le c) = \\ & 1-P_{\pi=0.5}(-\frac{\sqrt{n}c}{\sqrt{\pi(1-\pi)}} \le \sqrt{n}\frac{\hat{p}-\pi}{\sqrt{\pi(1-\pi)}} \le \frac{\sqrt{n}c}{\sqrt{\pi(1-\pi)}}) = \\ &1-(2\Phi(\frac{\sqrt{n}c}{\sqrt{\pi(1-\pi)}}-1))\\ &\Phi(\frac{\sqrt{n}c}{\sqrt{\pi(1-\pi)}})=1-\alpha/2 \\ &\frac{\sqrt{n}c}{\sqrt{\pi(1-\pi)}}=z_{\alpha/2} \\ &c=z_{\alpha/2}\sqrt{\pi(1-\pi)/n} \end{aligned} \]

If we use \(\alpha=0.05\) we find

cc <- qnorm(1-0.05/2)*sqrt(0.5*(1-0.5)/1000)
round(cc, 3)
## [1] 0.031

and so we would reject the null if

\[ \begin{aligned} &|\hat{p}-0.5|>0.031 \\ &x/n<0.5-0.031 \text{ or } x/n>0.5+0.031 \\ &x<469 \text{ or } x>531 \end{aligned} \] we got x=549, so we would indeed reject the null.

Example: Poisson means

A CR could be constructed by noting that according to the central limit theorem the sample means have approximate normal distributions.

Example: Beta parameters

we don’t even know how to find the mle’s analytically, so this won’t work.

The p value

The idea of the p value is as follows. Suppose we repeat the exact same experiment, how likely is it to observe the same outcome, or something even less likely, as what we just saw, assuming the null hypothesis is true?

If this probability is small (say \(< \alpha\)), then we just observed something very rare. Alternatively our assumption that the null hypothesis is true is false, and we should reject it!

Example: Coin Tossing

we got 549 heads in 1000 flips, so the p value would be the probability to flip a fair coin 1000 times and get 549 or more heads. Actually it would be \(2P_{\pi=0.5}(X>549)\) because under our alternative to few heads would also result in a rejection of the null.

round(2*(1-pbinom(549, 1000, 0.5)), 4)
## [1] 0.0017

\(0.0017<0.05\), and so we do reject the null

Power of a Test

the power of a test is the probability to reject the null when it is false.

Of course if our alternative is (say) \(H_0:p \ne 0.5\), then the power of a test is actually a function of the possible parameter values under the alternative hypothesis.

Example: Coin Tossing

our test is: reject H0 if \(x<469 \text{ or } x>531\). So the power of the test is to do so if p is any value:

cr <- c(0:468, 532:1000)
p <- seq(0.4, 0.6, length=250)
power <- p
for(i in 1:250) {
  power[i] <- sum(dbinom(cr, 1000, p[i]))
}
ggplot(data.frame(p=p, y=power),
       aes(p, y)) +
  geom_line(color="blue") +
  labs(x=expression(pi), y="Power")

so we see that if the \(\pi\) differs from 0.5 by more than 0.05, we will almost certainly be able to detect this with 1000 flips.

Also note that the power curve even includes the case p=0.5, the null hypothesis. Of course if we test at the 5% level, the power there is 0.05. Actually because the data comes from a discrete distribution it is \(\le 0.05\).


The power of a test has many uses:

  • decide whether an experiment is worth doing. If it has a low power, it might be a waste of time.
  • find what sample size would be required to have a reasonably powerful test.
  • decide which of several methods is best, aka has the highest power.

Next we will discuss a general method for deriving hypothesis tests called the

Likelihood Ratio test

The likelihood ratio test statistic is defined by

\[ \Lambda(\boldsymbol{x}) = \frac{ \max_{\Theta_0} L(\theta)}{\max L(\theta)} \] where L is the likelihood function.

From the definition it is clear that \(0 \le \Lambda \le 1\).

In the denominator the maximum is taken over all values of \(\theta\), so this is just like finding the maximum likelihood estimator!

Let’s find out what we have in the three examples:

Example: Coin Tossing

we have previously found

\[ L(\theta) = \theta^y(1-\theta)^{n-y} \] where y was the number of successes and n the number of trials. Also, here \(\Theta_0=\left\{0.5\right\}\), so in the numerator the maximum is taken over just one value. The mle is \(y/n\) so

\[ \begin{aligned} &\Lambda(\boldsymbol{x}) =\frac{0.5^y(1-0.5)^{n-y}}{(y/n)^y(1-(y/n))^{n-y}} = \\ & \left(\frac{n}{2y}\right)^y \left(\frac{n}{2(n-y)}\right)^{n-y} \\ \end{aligned} \]

Note that under the null \(\pi=0.5\), so \(y \sim n/2\), so \(n/(2y) \sim 1\), \(n/(2(n-y)) \sim 1\) and so \(\Lambda \sim 1\). Therefore a sensible test rejects the null if \(\Lambda\) is much smaller than 1.

Example: Poisson rates

Here we find

\[ f(x_1,..,x_n,y_1,..,y_n;\lambda_A, \lambda_B) = \prod_i \frac{\lambda_A^{x_i}}{x_i!}e^{-\lambda_A}\prod_i \frac{\lambda_B^{y_i}}{y_i!}e^{-\lambda_B} \]

\[ \begin{aligned} &l(\lambda_A, \lambda_B) = \log f = \\ &\sum x_i \log \lambda_A - n\lambda_A + \sum y_i \log \lambda_B - n\lambda_B +K \\ &\frac{dl}{d\lambda_A} = \frac{\sum x_i}{\lambda_A} -n = 0 \end{aligned} \]

and so \(\hat{\lambda}_A=\bar{X}\). Clearly also \(\hat{\lambda}_B=\bar{Y}\).

Under H0 \(\lambda_A=\lambda_B=:\lambda\), and we find

\[ l(\lambda) = \sum (x_i+y_i) \log \lambda - 2n\lambda \] and so the value that maximizes the numerator is \(\hat{\hat{\lambda}} = \frac{\sum (x_i+y_i)}{2n}\)

Example: Beta parameters

the beta density is given by

\[ f(x;\alpha,\beta)=\frac{\Gamma (\alpha+\beta) }{\Gamma (\alpha) \Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1} \] so the log likelihood function is given by

\[ l(\alpha,\beta)= n\log \Gamma (\alpha+\beta) - n\log \Gamma (\alpha) - n\log \Gamma (\beta) +\\ (\alpha-1) \sum \log x_i + (\beta -1) \sum \log(1-x_i) \]

now using calculus is certainly not going to work, so we need to use a numerical method:

beta.mle <- function(x) {
  log.like.denom <- function(par) {
    -sum(log(dbeta(x, par[1], par[2])))
  } 
  mle <- optim(c(1, 1), log.like.denom)$par
  log.like.num <- function(par) {
    -sum(log(dbeta(x, par, par)))
  } 
  mle.null <- optim(1, log.like.num)$par
  c(mle, mle.null)
}
beta.mle(beta.sample)
## [1] 2.097394 1.934793 1.998989

Wilks’ Theorem

The usefulness of the likelihood ratio test statistic comes from the following famous theorem due to Wilks (1938):

Under some regularity conditions we find

\[ -2 \log \Lambda(\boldsymbol{X}) \sim \chi^2(p) \]

where p is the difference between the number of free parameters in the model and the number of free parameters under the null hypothesis.

Therefore we reject the null at the \(\alpha\) level of significance if

\[ -2 \log \Lambda(\boldsymbol{X}) > \text{qchisq}(1-\alpha, p) \]

Notice that it is usually more convenient to calculate

\[ \begin{aligned} &-2 \log \Lambda(\boldsymbol{X}) = \\ & -2 \log \left\{ \frac{\Lambda(\hat{\hat{\theta}})}{\Lambda(\hat{\theta})} \right\} = \\ & 2 \left\{ l(\hat{\theta}) - l(\hat{\hat{\theta}})\right\} \\ \end{aligned} \]

We said before that \(0 < \Lambda < 1\), so \(-2\log \Lambda\ge0\). Also under H0 \(\Lambda \sim 1\), so \(-2 \log \Lambda \sim 0\). Therefore it makes sense to reject the null if \(-2 \log \Lambda\) is large. Now we know exactly how large, namely \(\text{qchisq}(1-\alpha, p)\).

Example: Coin Tossing

\[ \begin{aligned} &l(\theta)= y\log \theta +(n-y)\log(1-\theta)\\ &-2 \log \Lambda(\boldsymbol{x}) = \\ &2 \left\{ l(\hat{\theta}) - l(\hat{\hat{\theta}})\right\}=\\ & 2 \left\{ y\log (y/n) +(n-y)\log(1-y/n) - y\log 0.5 +(n-y)\log(1-0.5) \right\}=\\ & 2 \left\{ y\log \left( \frac{2y}{n} \right) + (n-y) \log \left( \frac{2(n-y)}{n} \right) \right\} \end{aligned} \]

n <- 1000
y <- 549
lrt <- 2*(y*log(2*y/n)+(n-y)*log(2*(n-y)/n))
lrt
## [1] 9.619432
qchisq(1-0.05, 1)
## [1] 3.841459

and so we reject the null hypothesis, this does not seem to be a fair coin.

We can also easily find the p value of the test:

1-pchisq(lrt, 1)
## [1] 0.001925293

Example: Poisson rates

Again we could do the math to find \(-2 \log \Lambda\), but we can also just do it numerically:

x <- poisons$Dead[poisons$Poison=="A"]
y <- poisons$Dead[poisons$Poison=="B"]
lrt <- 2*( sum(log(dpois(x, mean(x)))) +
           sum(log(dpois(y, mean(y)))) -
          sum(log(dpois(c(x, y), mean(c(x, y)))))   
)
lrt
## [1] 0.4525616
1-pchisq(lrt, 1)
## [1] 0.501121

and so we reject the null hypothesis at the \(5\%\) level.

Example: Beta parameters

p <- beta.mle(beta.sample)
p
## [1] 2.097394 1.934793 1.998989
lrt <- 2*( 
  sum(log(dbeta(beta.sample, p[1], p[2]))) -
  sum(log(dbeta(beta.sample, p[3], p[3])))
)
lrt
## [1] 1.681222
1-pchisq(lrt, 1)
## [1] 0.1947622

and here we fail to reject the null hypothesis.

Confidence Intervals vs Hypothesis Tests

At first these appear to be two entirely different topics. They are however very closely related. In general any hypothesis test can be turned into (“inverted”) a confidence interval and vice versa!

Example

\(X \sim N(\mu, \sigma)\), \(\sigma\) known. Then a \((1-\alpha)100\%\) confidence interval is given

\(x \pm z_{\alpha/2}\sigma\)

Let’s say we want to test

\(H_0: \mu=\mu_0 \text{ vs. }H_A:\mu \ne \mu_0\)

and we will reject the null if we observe an x such that the above confidence interval does not include \(\mu_0\). So we have the rejection region

\(\text{CR} = \left\{x: x-z_{\alpha/2}\sigma>\mu \text{ or }x+z_{\alpha/2}\sigma<\mu \right\}\)

Now \[ \begin{aligned} &P(\text{reject }H_0 | H_0 \text{ true}) = \\ &P_{\mu=\mu_0}( X-z_{\alpha/2}\sigma>\mu \text{ or }X+z_{\alpha/2}\sigma<\mu) = \\ &1-P_{\mu=\mu_0}( X-z_{\alpha/2}\sigma<\mu < X+z_{\alpha/2}\sigma) = \\ &1-P_{\mu=\mu_0}( -z_{\alpha/2}\sigma<X-\mu_0 < z_{\alpha/2}\sigma) = \\ &1-P_{\mu=\mu_0}( -z_{\alpha/2}< \frac{X-\mu_0}{\sigma} < z_{\alpha/2}) = \\ &1-(2\Phi(z_{\alpha/2})-1) = 2-2(1-\alpha/2)=\alpha \end{aligned} \] and so we have a level \(\alpha\) test.

In fact a general strategy for finding confidence intervals is to first find a corresponding hypothesis test (which is often easier), and then to invert the test.

Example

Let’s look at another example to illustrate the important point here. Let’s say we have \(X_1, .., X_n \sim Ber(\pi)\) and we want to find a \(100(1-\alpha)\%\) CI for p. To do so we first need to find a test for \[ H_0: \pi=\pi_0 \text{ vs. } H_a: \pi \ne \pi_0 \] as we saw before a test with a rejection region of the form \(|x-n \pi_0|>c\) is reasonable.

Let’s find c:

\[ \begin{aligned} &\alpha = P_{\pi=\pi_0}(\text{reject }H_0 ) =\\ &P_{\pi=\pi_0}(|X-n \pi_0|>c ) = \\ &1-P_{\pi=\pi_0}(n \pi_0-c \le X \le n \pi_0+c) = \\ &1-\alpha= \text{pbinom}(n\pi_0+c,n,\pi_0)-\text{pbinom}(n\pi_0-c,n,\pi_0) \end{aligned} \] Because the Binomial is a discrete random variable it won’t be posible to find a c such we have equality. Instead we will find the samllest so c that

\(\text{pbinom}(n\pi_0+c,n,\pi_0)-\text{pbinom}(n\pi_0-c,n,\pi_0)\ge 1-\alpha\):

find.A <- function(p, n=100, alpha=0.05) {
  k <- 0
  repeat {
    k <- k + 1
    if(pbinom(n*p+k, n, p) - 
       pbinom(n*p-k-1, n, p) > 1 - alpha) 
            break
  }
  k
  cv <- k
  x <- 0:n
  y <- pbinom(n*p+x, n, p) - pbinom(n*p-x-1, n, p)
  reject <- ifelse(x<n*p-cv | x>n*p+cv, TRUE, FALSE)
  data.frame(x, y, p=rep(p, n+1), reject)
}
df <- find.A(0.4)
ggplot(data=df[1:20, ], aes(x, y)) +
  geom_point() +
  geom_vline(xintercept = 10) +
  geom_hline(yintercept = 0.95)

So for \(\pi_0=0.4\) and \(n=100\) we find \(c=10\). Therefore the test is as follows: reject the null hypothesis if \(|x-40|>10\). This is the same as \(x<40\) or \(x>50\).

Next we define the acceptance region of a test as the complement of the rejection region.

We can illustrate the acceptance region of our test as follows:

ggplot(data=df, aes(x, p, color=reject)) +
  geom_point() + ylim(c(0,1)) +
  ylab(expression(pi))

This plots a dot for each possible observation x (0-n), in red if observing this value leads to accepting the null hypothesis, in blue if it means rejecting H0.

Let’s do this now for other values of \(\pi\) as well:

p <- seq(0.01, 0.99, length=50)
df <- find.A(p[1])
for(i in 2:50)
  df <- rbind(df, find.A(p[i]))
ggplot(data=df, aes(x, p, color=reject)) +
  geom_point() + ylim(c(0,1))  +
  ylab(expression(pi)) +
  theme(legend.position="none")

This graph tells us everything about the test (for a fixed n).

For example, say we wish to test

\(H_0: \pi=0.25\) and we observe \(x=31\), then

ggplot(data=df, aes(x, p, color=reject)) +
  geom_point() + 
  geom_hline(yintercept = 0.25, size=2) +
  geom_vline(xintercept = 31, size=2) +
  theme(legend.position="none")

shows that we should accept the null hypothesis because the intersection of the two lines is in the red (acceptance) region.

The idea of inverting the test is now very simple: for a fixed (observed) x=x0, what values of p lead to accepting the null hypothesis?

x0 <- 31
z <- rep(FALSE, dim(df)[1])
z[df$x==x0 & !df$reject] <- TRUE
p0 <- range(df$p[z])
ggplot(data=df, aes(x, p, color=reject)) +
  geom_point() + 
  geom_vline(xintercept = 31) +
  geom_segment(x=x0, y=p0[1],xend=x0, yend=p0[2], 
               size=2, color="black") +
  theme(legend.position="none")

p0
## [1] 0.23 0.41

here is the same graph as before, but now finding those values of p which lead to accepting the null hypothesis. This gives us the \(95\%\) confidence interval of (0.23, 0.41).

In essence we have the following:

  • to do a hypothesis test scan across the graph horizontally

  • to find a confidence interval scan across the graph vertically.

By the way, this in essence a well known method for finding confidence intervals for a Binomial parameter. It is called the Clopper-Pearson method and was invented in 1934. It is implemented in R in

round(binom.test(31, 100)$conf.int, 3)
## [1] 0.221 0.410
## attr(,"conf.level")
## [1] 0.95