Inference for Binomial p

We have data \(Z_1,.., Z_n \sim Ber(p)\) and we want to develop a confidence interval for p. 

First we need a corresponding hypothesis test:

\[H_0: p=p_0\text{ vs }H_a: p \ne p_0\]

If half-open intervals (“upper or lower limits”) are desired use the corresponding alternative hypotheses.

Let \(X=\sum_{i=1}^n Z_i\), then \(X \sim Bin(n,p)\)

Note: in this example X is the obvious choice to base a test on because \(\bar{z} = x/n\) is an unbiased, sufficient and consistent estimator of p. 

Next we need a rejection region, that is we need to decide what values of S will lead us to reject the null hypothesis.

And we get to the first fork in the road:

Testing Idea 1:

Let \(Y \sim Bin(n,p_0)\), reject \(H_0\) if

\[P(Y<x|p_0) < \alpha/2\]

or if

\[P(Y>x|p_0) < \alpha/2\]

Example (6.4.1)

n <- 20; p0 <- 0.3; alpha=0.05
x <- 3
pbinom(x-1, n, p0)
## [1] 0.03548313

0.0354 > 0.025 and so we fail to reject the null hypothesis.

x <- 2
pbinom(x-1, n, p0)
## [1] 0.00763726

0.0076 < 0.025 and so we reject the null hypothesis.

Testing Idea 2:

Let \(TS=\frac{x-np_0}{\sqrt{np_0(1-np_0)}}\) and reject \(H_0\) if \(|T|>z_{\alpha/2}\). This idea is obviously based on the central limit theorem.

Example (6.4.2)

x <- 0:20
TS <- round((x-n*p0)/sqrt(n*p0*(1-p0)), 3)
df <- data.frame(x=x, a=abs(TS),
        b=ifelse(abs(TS)>qnorm(1-alpha/2), "Yes", "No"),   
        Decision=ifelse(abs(TS)>qnorm(1-alpha/2), 
                        "Reject", "Fail to reject"))
colnames(df)[2:3] <- c('|TS|', '|TS|>crit')
# critical value
qnorm(1-0.05/2)
## [1] 1.959964
kable.nice(df, do.row.names = FALSE)
x |TS| |TS|>crit Decision
0 2.928 Yes Reject
1 2.440 Yes Reject
2 1.952 No Fail to reject
3 1.464 No Fail to reject
4 0.976 No Fail to reject
5 0.488 No Fail to reject
6 0.000 No Fail to reject
7 0.488 No Fail to reject
8 0.976 No Fail to reject
9 1.464 No Fail to reject
10 1.952 No Fail to reject
11 2.440 Yes Reject
12 2.928 Yes Reject
13 3.416 Yes Reject
14 3.904 Yes Reject
15 4.392 Yes Reject
16 4.880 Yes Reject
17 5.367 Yes Reject
18 5.855 Yes Reject
19 6.343 Yes Reject
20 6.831 Yes Reject

and so we reject \(H_0\) if \(x<2\) or \(x>10\).

Now we have some tests. To get to a confidence interval we have to “invert” these tests.

The interval will contain all the parameter values that would have lead to accepting the null hypothesis given the observed data.

Interval Idea 1:

For a fixed x find \(p_1\) so that

\[P(Y<x|p_1)=1-\alpha/2\]

and \(p_2\) so that

\[P(Y>x|p_2)=1-\alpha/2\]

or

\[P(Y\le x|p_2)=\alpha/2\]

then \((p_1, p_2)\) is the confidence interval.

Example (6.4.3)

x <- 3; n <- 20; alpha <- 0.05
p <- seq(0.01, 0.4, length=1000)
y1 <- pbinom(x-1, n, p)
y2 <- pbinom(x, n, p)
df <- data.frame(p=c(p, p), y=c(y1, y2),
              which=rep(c("Lower", "Upper"), each=1000))
ggplot(df, aes(p, y, coloer=which)) + 
  geom_line()

we saw that the curves are strictly decreasing, so we can find the solutions with

round(c(p[y1<1-alpha/2][1], p[y2<alpha/2][1]), 4)
## [1] 0.0323 0.3793

This method is called the Clopper-Pearson method. It was invented in 1934.

Above we found p1 and p2 via a grid search. One could also write a numerical routine to do that.

Or we can use a theorem from probability theory: if \(Z\sim Beta(x,n-x+1)\) and \(Y\sim Bin(n,p)\), then

\[P(Z<p)=P(Y<x)\] so with n=x and m=n-x+1 we have

\[ \begin{aligned} &1-\alpha/2 = P(Y< x|p) =\\ &P(Z<p|x,n-x+1) = \\ &p =qbeta(1-\alpha/2\vert x,n-x+1) \end{aligned} \]

round(c(qbeta(0.025, 3, 18), qbeta(0.975,4,17)), 4)
## [1] 0.0321 0.3789

This is already implemented in the base R routine

round(c(binom.test(3, 20)$conf.int), 4)
## [1] 0.0321 0.3789

Innterval Idea 2:

the acceptance region is

\[ -z_{\alpha/2}<\frac{x-np_0}{\sqrt{np_0(1-np_0)}}<z_{\alpha/2} \]

so we need to solve these equations for \(p_0\). Again we have choices:

Option 1:

\(x/n\) is an estimator of p0 so let’s replace the p0 in the denominator with x/n:

\[ \begin{aligned} & \\ &p=x/n\pm z_{\alpha/2}\sqrt{x/n(1-x/n)} \end{aligned} \]

Example (6.4.4)

x <- 3; n <- 20; alpha <- 0.05
round(x/n + c(-1, 1)*qnorm(1-alpha/2)*sqrt(x/n*(1-x/n)), 4)
## [1] -0.5498  0.8498

This is called a Wald type interval (because we replace var(X) with an estimate). It was invented in 1948.

We already see one obvious “bad” feature: the lower bound is negative. First of all p can never be negative. More than that because we observed x=3 we know p must be positive.

Option 2

Work a little harder

\[ \begin{aligned} &\frac{x-np}{\sqrt{np(1-p)}}=\pm z_{\alpha/2} \\ &\frac{(x-np)^2}{np(1-p)}=\pm z_{\alpha/2}^2 =: z \\ &(x-np)^2 = znp(1-p)\\ &n(n+z)p^2-n(2x+z)p+x^2=0\\ &p_{1,2} = \frac1{2n(n+z)}\left(n(2x+z)\pm\sqrt{[n(2x+z)]^2-4n(n+z)x^2}\right) = \\ &\frac1{n+z}\left(x+z/2\pm\sqrt{zx+z^2/4-zx^2/n}\right) \end{aligned} \]

Example (6.4.5)

x <- 3; n <- 20; alpha <- 0.05
z <- qnorm(1-alpha/2)^2
round( (x+z/2+c(-1, 1)*sqrt(z*x+z^2/4-z*x^2/n))/(n+z), 4)
## [1] 0.0524 0.3604

This is called a Wilson interval. It was invented in 1927.

Ad-Hoc Adjustments

There have been a number of adjustments suggested for a variety of these. Here is one example:

we are approximating a discrete rv by a continuous one, so maybe it is a good idea to correct for that a bit by subtracting (for lower bound) and adding (for upper bound) 1/2 to x.

This can be applied to both the Wald and the Wilson type intervals. For the Wald test this already done in the built-in function prop.test.

Note there is a similar adjustment for Clopper-Pearson intervals called mid-p intervals, which we won’t have time to discuss.

How to choose:

How do we pick between these? First we need to make sure that the methods have coverage

Example (6.4.6)

say we have n=20 and p=3, so x/n=0.15, so the true p is probably between 0.05 and 0.3 . Let’s do the coverage graphs:

so the only two that “work” are Clopper-Pearson and Wilson with continuity correction.

How do we choose between these two? One possible criterion is the expected length of the interval:

\[e(p)=E\left[U(X)-L(X)\right]=\sum_{i=0}^n (u(i)-l(i)){n\choose i}p^i(1-p)^{n-i}\]

drawn here:

so for p around 0.15 they are pretty much the same.