Evaluating Hypothesis Tests

The Power of a Test

In a hypothesis test the type I error probability \(\alpha\) is defined by

\[\alpha = P(\text{reject }H_0\vert H_0 \text { is true})\]

and is chosen by the analyst at the beginning of the test. On the other hand the type II error probability \(\beta\) is defined by

\[\beta = P(\text{accept }H_0\vert H_0 \text { is false})\]

Example (5.2.1)

say we have \(X_1, .., X_n\sim Ber(p)\) and we want to test \(H_0: p=0.5\) vs \(H_a: p=0.6\).

Now if the null hypothesis is wrong we would expect more successes than the 50%, so it seems reasonable to reject the null hypothesis if there are many successes, more than should happen if p=0.5. This suggests to use a test with the rejection region

\[\{S > cv\}\]

where S is the number of successes. cv here is some thresh-hold value (57? 58?) and is usually called the critical value. It is found as follows:

S is the number of successes in n independent Bernoulli trials with success parameter p, so S~Ber(n,p). Now

\[ \begin{aligned} &\alpha = P(S > cv | p=0.5) = \\ &1 - P(S \le cv | p=0.5) = \\ &1 - pbinom(cv,n,0.5)\\ &\text{ } \\ &1-\alpha = pbinom(cv,n,0.5) \\ &cv = qbinom(1-\alpha,n,0.5) \end{aligned} \]

As a numerical example say \(\alpha=0.05\) and n=100, then

qbinom(1-0.05, 100, 0.5)
## [1] 58

Now for the type II error probability \(\beta\) we have to calculate the probability to fail to reject the null hypothesis if the alternative is right, that is if p=0.6:

\[ \beta = P( S \le cv | p=0.6) = \\ pbinom(cv,n,0.6) = \\ pbinom(qbinom(1-\alpha,n,0.5),n,0.6) \]

round(pbinom(qbinom(1-0.05, 100, 0.5), 100, 0.6), 4)
## [1] 0.3775

Example (5.2.2)

say we have \(X_1, .., X_n\sim Ber(p)\) and now we want to test

\[H_0: p=0.5 \text{ vs }H_a: p>0.5\]

Notice that the alternative hypothesis does not play a role in the calculation of the critical value, so again we have

\[cv = qbinom(1-\alpha,n,0.5)\]

but when we want to find \(\beta\) we have a problem, we don’t know what the p is. What we can do is find \(\beta\) as a function of p:

\[ \begin{aligned} &\beta(p) = P(S \le cv | p) = \\ &pbinom(cv,n,p) = \\ &pbinom(qbinom(1-\alpha,n,0.5),n,p) \end{aligned} \]


In real life we usually calculate the power of the test, defined by

\[\text{Pow}(p)=1-\beta(p)\]

It has two advantages:

  1. it gives the probability of correctly rejecting a false null hypothesis

  2. \(\text{Pow}(p_0)=\alpha\)

The power curve for this test is drawn here

fun <- function(p)
  1-pbinom(qbinom(1-0.05, 100, 0.5), 100, p)
ggcurve(fun=fun, A=0.5, B=0.7)

Example (5.2.3)

say we have \(X_1, ... , X_n\sim N(\mu,\sigma)\), \(\sigma\) known, and we want to test

\[H_0: \mu=\mu_0\text{ vs }H_a: \mu\ne\mu_0\]

Again \(\bar{x}\) is the mle, and a reasonable test statistic is given by

\[ Z=\sqrt{n}\frac{\bar{X} -\mu_0}{\sigma} \sim N(0,1) \]

so a test might use the rejection region \(\{|Z|>cv\}\):

\[ \begin{aligned} &\alpha=P\left(|Z|>cv\vert\mu_0\right) = \\ &1-P\left(-cv<Z<cv\vert\mu_0\right) = \\ &1-\left(2\Phi(cv)-1\right) = \\ &2(1-\Phi(cv))\\ &cv=\Phi^{-1}(1-\alpha/2) \end{aligned} \]

and now

\[ \begin{aligned} &\beta(\mu_1) = P\left(|Z|<cv\vert\mu_1\right) = \\ &P\left(-cv<Z<cv\vert\mu_1\right) = \\ &P\left(-cv<\sqrt{n}\frac{\bar{X} -\mu_0}{\sigma}<cv\vert\mu_1\right) = \\ &P\left(-cv<\sqrt{n}\frac{\bar{X} -\mu_1+\mu_1-\mu_0}{\sigma}<cv\vert\mu_1\right) = \\ &P\left(-cv<\sqrt{n}\frac{\bar{X} -\mu_1}{\sigma}+\sqrt{n}\frac{\mu_1-\mu_0}{\sigma}<cv\vert\mu_1\right) = \\ &P\left(-cv-\sqrt{n}\frac{\mu_1-\mu_0}{\sigma}<\sqrt{n}\frac{\bar{X} -\mu_1}{\sigma}<cv-\sqrt{n}\frac{\mu_1-\mu_0}{\sigma}\vert\mu_1\right) = \\ &\Phi\left(cv-\sqrt{n}\frac{\mu_1-\mu_0}{\sigma}\right)-\Phi\left(-cv-\sqrt{n}\frac{\mu_1-\mu_0}{\sigma}\right) \end{aligned} \]

The power curve for this test looks like this

power.mean <- function(mu, mu0=0, sd=1, n=25, alpha=0.05) {
  cv <- qnorm(1-alpha/2)
  1-(pnorm(cv-sqrt(n)*abs(mu-mu0)/sd) - 
  pnorm(-cv-sqrt(n)*abs(mu-mu0)/sd))  
}
ggcurve(fun=power.mean, A=-1, B=1)

Example (5.2.4)

Again we have \(X_1, .., X_n\sim N(\mu,\sigma)\), \(\sigma\) known, and now we want to test

\[H_0: \mu=\mu_0\text{ vs }H_a: \mu\ne\mu_0\]

but this time we will use the median M as an estimator of \(\mu\).

Again a reasonable rejection region is \(\{|M-\mu|>cv\}\).

if n is odd we have \(M=X_{[(n+1)/2]}\), the (n+1)/2 order statistic of \(X_1, ..., X_n\), so

\[f_M(x\vert\mu) = \frac{n+1}2{{n}\choose {(n+1)/2}}\phi(x\vert\mu)\Phi(x\vert\mu)^{(n-1)/2}(1-\Phi(x\vert\mu))^{(n-1)/2}\]

where \(\phi(x|\mu)\) and \(\Phi(x|\mu)\) are the density and cdf of normal rv’s with mean \(\mu\) (and sd \(\sigma\)).

Let’s see what the density looks like for n=99, and compare it to the one of the mean:

dmedian <-  function(x, mu=0, sd=1, n) {
  (n+1)/2*choose(n, (n+1)/2)*dnorm(x, mu, sd) *  
    pnorm(x, mu, sd)^((n-1)/2) * 
    (1-pnorm(x, mu, sd))^((n-1)/2)
}
x <- seq(-0.3, 0.3, length=250)
df <- data.frame(x=c(x, x),
    y=c(dnorm(x, sd=1/sqrt(99)), dmedian(x, n=99)),
    Method=rep(c("Mean", "Median"), each=250)             )
ggplot(data=df, aes(x, y, color=Method)) +
  geom_line() 

Now

\[ \begin{aligned} &\alpha=P\left(|M-\mu_0|>cv\vert\mu_0\right) = \\ &1-P\left(-cv<M-\mu_0<cv\vert\mu_0\right) = \\ &1-P\left(-cv+\mu_0<M<cv+\mu_0\vert\mu_0\right) = \\ &1-\int_{cv+\mu_0}^{-cv+\mu_0} f_M(x\vert\mu_0)dx \end{aligned} \]

and cv is the solution of this equation, which of course can not ne found analytically. Instead we can find it numerically using the integrate function in R:

  1. set cv=0
  2. set cv=cv+0.01
  3. find a=integrate(f,mu-cv,mu+cv)$value
  4. if \(a>1-\alpha\), done, otherwise go back to 1

Finally

\[\beta(\mu_1)=P(\mu_0-cv<M<\mu_0+cv | \mu_1)\]

which we can again find using the integrate function.

power.median <- function(mu, mu0=0, sd=1, n=25, alpha=0.05) {
  cv <- 0
  repeat {
    cv <- cv+0.01
    a <- integrate(dmedian, -cv+mu0, cv+mu0, 
                   mu=mu0, sd=sd, n=n)$value
    if(a>1-alpha) break
  }
  y <- 0*mu
  for(i in seq_along(mu)) 
    y[i] <- 1-integrate(dmedian, mu0-cv, mu0+cv, 
               mu=mu[i], sd=sd, n=n)$value
  y  
}
x <- seq(-1, 1, length=250)
y1 <- power.mean(x)
y2 <- power.median(x)  
df <- data.frame(x=c(x, x),
                 y=c(y1, y2),
        which=rep(c("Mean", "Median"), each=250))             
ggplot(data=df, aes(x, y, color=which)) +
  geom_line() 

Example (5.2.5)

Again we have \(X_1, .., X_n\sim N(\mu,\sigma)\), \(\sigma\) known, and again we want to test

\[H_0: \mu=\mu_0\text{ vs }H_a: \mu\ne\mu_0\]

If we are worried about possible outliers we might decide to use a trimmed mean as our estimator: the \(100p\%\) trimmed mean is defined by

\[T_p=\frac1{n-2np}\sum_{i=\left \lfloor{np}\right \rfloor}^{\left \lceil{n(1-p)}\right \rceil} x_{[i]}\]

In other words, to find the \(100p\%\) trimmed mean eliminate the \(100p\%\) smallest and largest observations and find the mean of the rest.

Note: mean=\(0\%\) trimmed mean and median~\(50\%\) trimmed mean. In R use the mean(x, trim=p) function.

Again a reasonable test has rejection region \(\{|T_p-\mu|>cv\}\). But what is the distribution of Tp? This can not be done analytically for a general p, so either we do some heavy math every time we want a different p, or we need a different solution. Here is one based on simulation:

to find cv:

  1. generate \(Y_1, .., Y_n\sim N(\mu_0,\sigma)\), calculate \(T_p\), call it \(T_p(1)\)
  2. repeat 1. many times, say 10000 times
  3. Find cv such that \(100\alpha\%\) of the \(|T_p-\mu_0|\)’s are greater than cv

to find \(\beta(\mu_1)\):

  1. generate \(Y_1, .., Y_n\sim N(\mu_1,\sigma)\), calculate \(T^{*}_p\), call it \(T^{*}_p(1)\)
  2. repeat 1) many times, say 10000 times
  3. Find \(\beta(\mu_1)\) as the proportion of \(T^{*}_p\) such that \(|T^{*}_p-\mu_0|>cv\).
power.trim <- function(mu, p=0.25, mu0=0, sd=1, 
                       n=25, alpha=0.05, B=10000) {
  Tp <- matrix(rnorm(B*n, mu0, sd), ncol=n)
  Z <- apply(Tp, 1, mean, trim = p)
  cv <- quantile(abs(Z), 1-alpha)
  out <- 0*mu
  for(i in seq_along(mu)) {
    Tp <- matrix(rnorm(B*n, mu[i], sd), ncol=n)
    Z <- abs(apply(Tp, 1, mean, trim = p)-mu0)
    out[i] <- sum(Z > cv)/B
  }
  out
}
df <- data.frame(x=c(x, x, x),
         y=c(y1, y2, y3=power.trim(x)),
        which=rep(c("Mean", "Median", "Trim"), each=250))             
ggplot(data=df, aes(x, y, color=which)) +
  geom_line() 

as one would expect, the power of the \(25\%\) trimmed mean test is between those of the mean and the median.

Example (5.2.6)

say \(X_1,..,X_n\sim Ber(p)\), and we want to test

\[H_0: p=p_0\text{ vs. }H_a:p>p_0\]

As above a reasonable test can be based on \(\{\bar{x} >cv\}\), which is equivalent to \(\{\sum x_i\ge k\}\) for some integer k. Say for example n=10, p0=0.5 and \(\alpha=0.1\). Then

sum(dbinom(10, 10, 0.5))
## [1] 0.0009765625
sum(dbinom(9:10, 10, 0.5))
## [1] 0.01074219
sum(dbinom(8:10, 10, 0.5))
## [1] 0.0546875
sum(dbinom(7:10, 10, 0.5))
## [1] 0.171875

so for k=8 \(P(\text{reject }H_0|H_0\text{ is true})<\alpha\) and for k=7 \(P(\text{reject }H_0|H_0\text{ is true})>\alpha\).

Because of the discreteness of the random variable it is not actually possible to find a cv such that \(P(\text{reject }H_0|H_0\text{ is true})=\alpha\). In this case we use

\[\min\{k: P(\text{reject }H_0|H_0\text{ is true})<\alpha\}\]

or k=8.

There is a way to achieve \(\alpha\) exactly: If we get \(x>7\) we reject the null, if we get \(x<7\) we fail to reject the null. If we get \(x=7\) we flip a coin that give success with probability 0.0047, and if we get a success we reject the null, otherwise we fail to reject the null. It is easy to see that now our test has exactly 0.05 as the type I error rate.

Such tests are called randomized. They play some role in the theory of statistics, but are not really used in practice.

Neyman Pearson Theory

Definition (5.2.7)

Any collection C of tests is called a class of tests

Example (5.2.8)

let \(X_1, .., X_n\sim N(\mu,\sigma)\), \(\sigma\) known, and assume we want to test

\[H_0: \mu=\mu_0\text{ vs }H_a: \mu>\mu_0\]

  1. C = “reject \(H_0\) if \(\{\bar{x} > cv\}\)

  2. C = “reject \(H_0\) if \(\{\bar{x} > cv\}\) or reject \(H_0\) if {Median > cv}”

  3. C= “let T be any unbiased estimator of \(\mu\), reject \(H_0\) if {T> cv}”

are all classes of tests.

Definition (5.2.9)

a test is called a level \(\alpha\) test if

\[P(\text{reject }H_0\vert H_0 \text{ true}) \le \alpha\]

Definition (5.2.10)

Let C be a class of tests for testing

\[ H_0: \theta \in \Theta_0\text{ vs } H_a: \theta \in \Theta_0^c \]

A test in C with power function Pow(\(\theta\)) is a uniformly most powerful (UMP) class C test if

\[Pow(\theta) \ge Pow'(\theta)\]

for all \(\theta \in \Theta_0^c\) and every power function Pow’ for every test in C.

If the class C is the class of all tests with level \(\alpha\), it is called the UMP level \(\alpha\) test.

Theorem (5.2.11)

Neyman-Pearson lemma

Consider testing

\[H_0: \theta=\theta_0\text{ vs }H_a: \theta=\theta_1\]

using a test with rejection region R given by

\[ x \in R \text{ if }\frac{f(x|\theta_1)}{f(x|\theta_0)}> k \]

and

\[ x \in R^c \text{ if } \frac{f(x|\theta_1)}{f(x|\theta_0)} <k \]

for some \(k \ge 0\) and \(\alpha=P(X \in R| \theta_0)\).

Then

  1. (sufficiency) Any test of this form is a UMP level \(\alpha\) test.

  2. (necessity) If there exists a test of this form with k>0, then every UMP level \(\alpha\) test is of this form.

Note: we have written the theorem in terms of the f, but we could of course also have used the likelihood function L.

Example (5.2.12)

let \(X_1, .., X_n\sim N(\mu,\sigma)\), \(\sigma\) known, and assume we want to test

\[H_0: \mu=\mu_0\text{ vs }H_a: \mu=\mu_1\]

Then

\[ \begin{aligned} &f(\pmb{x}\vert\mu) = \\ &(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^n\left(x_i-\mu\right)^2 \right\} = \\ &(2\pi)^{-n/2}\sigma^{-n-1}\exp \left\{-\frac{(n-1)s^2}{2\sigma^2} \right\}\exp \left\{-\frac{n(\mu-\bar{x})^2}{2\sigma^2} \right\} \end{aligned} \] so

\[ \begin{aligned} &\frac{f(\pmb{x}\vert\mu_1)}{f(\pmb{x}\vert\mu_0)} = \\ &\frac{(2\pi)^{-n/2}\sigma^{-n-1}\exp \left\{-\frac{(n-1)s^2}{2\sigma^2} \right\}\exp \left\{-\frac{n(\mu_1-\bar{x})^2}{2\sigma^2} \right\}}{(2\pi)^{-n/2}\sigma^{-n-1}\exp \left\{-\frac{(n-1)s^2}{2\sigma^2} \right\}\exp \left\{-\frac{n(\mu_0-\bar{x})^2}{2\sigma^2} \right\}} = \\ &\exp \left\{\frac{n(\mu_0-\bar{x})^2}{2\sigma^2}-\frac{n(\mu_1-\bar{x})^2}{2\sigma^2} \right\} = \\ &\exp \left\{-\frac{n}{2\sigma^2}\left[(\mu_1-\bar{x})^2-(\mu_0-\bar{x})^2\right] \right\} >k \end{aligned} \]

if and only if

\[-\frac{n}{2\sigma^2}\left[(\mu_1-\bar{x})^2-(\mu_0-\bar{x})^2\right]>\log k\] if and only if

\[(\mu_1-\bar{x})^2-(\mu_0-\bar{x})^2<-\frac{2\sigma^2}{n}\log k\] if and only if

\[2\bar{x}(\mu_0-\mu_1)+\mu_1^2-\mu_0^2<-\frac{2\sigma^2}{n}\log k\]

if and only if

\[ \left\{ \begin{array}{lll} \bar{x}<-\frac{\sigma^2/n\log k-\mu_1^2/2+\mu_0^2/2}{\mu_0-\mu_1}&\text{if}&\mu_0>\mu_1\\ \bar{x}>-\frac{\sigma^2/n\log k-\mu_1^2/2+\mu_0^2/2}{\mu_0-\mu_1}&\text{if}&\mu_0<\mu_1 \end{array} \right. \] here we kept track of the terms on the right, but in fact they don’t matter because we don’t know k anyway. Therefore a UMP level \(\alpha\) test is of the from \(\bar{x} < cv\) (if \(\mu_0>\mu_1\)), where

\[\alpha=P(\bar{X} <cv)\]


Note

\[\frac{f(x|\theta_1)}{f(x|\theta_0)}=\frac{L(\theta_1|x)}{L(\theta_0|x)}\]

so a Neyman-Pearson type test is based on the ratio of the likelihood functions.

Example (5.2.13)

\(H_0: \lambda=1\) vs \(H_a: \lambda=2\)

Now

\[ \begin{aligned} &f(x;\lambda) = \frac{\lambda^x}{x!}e^{-\lambda} \\ &\frac{f(x;1)}{f(x;2)} = \frac{\frac{1^x}{x!}e^{-1}}{\frac{2^x}{x!} e^{-2}} = e/2^x\\ &e/2^x>k \text{ iff } \\ &e/k>2^x \text{ iff } \\ &x < \log(e/k)/\log(2) = [1-\log(k)]/\log(2)\\ &\alpha = P(X<cv|\lambda=1) \end{aligned} \]

We have

dpois(0, 1)
## [1] 0.3678794

is already larger than 0.05, so the test is to reject the null if we observe x>0.


The Neyman-Pearson lemma only discusses tests of simple vs simple hypotheses. These are very rare. It can also be shown that the theorem fails in any more generality. However, in most cases tests based on the likelihood ratio turn out to be very good, even so they are not necessarily the best.