Difference in Proportions

Example (7.2.1)

In a survey of 1000 likely voters 523 said they will vote for party A, the other 477 for party B. Find a 95% CI for the lead of one party over the other.

First we need a probability model for this experiment. Here this is clearly as follows: let \(X_i=1\) if vote is for A, 0 if it is for B, then \(X_i\sim Ber(p)\). We can assume that \(X_1, .., X_n\) are independent. The parameter of interest is the difference in proportions \(\theta=p-(1-p)= 2p-1\).

Frequentist Analysis

we already know that the mle of p is \(\bar{x}\), so the mle of \(\theta\) is \(2\bar{x}-1\) , here \(2\times 0.523-1= 0.046\).

Of course if \(\theta=2p-1\) we have \(p=[1+\theta]/2\). Let \(y=\sum x_i\), then

\[ \begin{aligned} &L(p\vert\pmb{x})=f(\pmb{x}\vert p) = p^y(1-p)^{n-y}\\ &L(\theta\vert\pmb{x}) =\left(\frac{1+\theta}2\right)^y\left(1-\frac{1+\theta}2\right)^y = (1+\theta)^y(1-\theta)^{n-y}/2^n\\ &\lambda(\pmb{x}) = \frac{L(\theta_0\vert\pmb{x})}{L(\hat{\theta}\vert\pmb{x})}= \frac{(1+\theta_0)^y(1-\theta_0)^{n-y}}{(1+[2\bar{x}-1]))^y(1-[2\bar{x}-1])^{n-y}} \\ &\frac{(1+\theta_0)^y(1-\theta_0)^{n-y}}{2^n(y/n)^y(1-y/n)^{n-y}} \\ \end{aligned} \]

It is easy to show that \(\lambda(\pmb{x})\) is large if and only if \(\vert y/n|\vert\).

Now

\[ \begin{aligned} &\frac{\bar{X}-p}{\sqrt{p(1-p)/n}} \sim N(0,1) \\ &\frac12 \frac{\hat{\theta}-\theta_0}{\sqrt{p(1-p)/n}}=\frac12 \frac{(2\bar{X}-1)-(2p-1)}{\sqrt{p(1-p)/n}} \sim N(0,1) \\ & = \\ \end{aligned} \] and so a \((1-\alpha)100\%\) confidence interval for \(\theta\) is given by

\[\left(2\bar{x}-1-2z_{\alpha/2}\sqrt{\bar{x}(1-\bar{x})/n}\text{, }2\bar{x}-1+2z_{\alpha/2}\sqrt{\bar{x}(1-\bar{x})/n} \right)\] Notice that the estimation error is twice the one for a Binomial p. 

For our numbers we get the interval

x <- 523; n <- 1000; alpha <- 0.05
round(2*x/n-1 + 
        c(-1, 1)*2*qnorm(1-alpha/2)*sqrt(x/n*(1-x/n)/n), 3)
## [1] -0.016  0.108

Bayesian Analysis

Our parameter is \(\theta\) with values in [-1,1], so we need a prior with values in this interval. For p we usually use \(Beta(\alpha, \beta)\), and then we have

\[ \begin{aligned} &p\vert\pmb{x}\sim Beta(\alpha+y,n-y+\beta) \\ &F_{\theta\vert\pmb{x}}(t) =P(\theta<t\vert\pmb{x}) =\\ &P(2p-1<t\vert\pmb{x}) = \\ &P\left(p<\frac{t+1}2\vert\pmb{x}\right) \end{aligned} \]

and so we find the posterior distribution to be

\[ \begin{aligned} &f_{\theta\vert\pmb{x}}(t) = f_{p\vert\pmb{x}}(\frac{t+1}2)\frac12 =\\ &\frac{\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+y)\Gamma(\beta-y+n)}\left(\frac{t+1}2\right)^{\alpha+y-1}\left(1-\frac{t+1}2\right)^{\beta-y+n-1}\frac12 = \\ &\frac{\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+y)\Gamma(\beta-y+n)}(1+t)^{\alpha+y-1}(1-t)^{\beta-y+n-1}2^{\alpha+\beta+n+3} \end{aligned} \] for \(-1<t<1\).

For a credible interval we can use \(\alpha/2\) on the left and right, and so we find

\[\alpha/2 =P(\theta<t\vert\pmb{x})=P\left(p<\frac{t+1}2\vert\pmb{x}\right)\] and so \(l=2qbeta(\alpha/2,\alpha+y,\beta-y+n )-1\) and \(u=2qbeta(1-\alpha/2,\alpha+y,\beta-y+n )-1\)

round(2*qbeta(c(alpha/2, 1-alpha/2), 1+x, n-x+1)-1, 3)
## [1] -0.016  0.108