Two-way Tables

Categorical Data

Example (7.5.1)

Psychological and social factors can influence the survival of patients with serious diseases. One study examined the relationship between survival of patients with coronary heart disease and pet ownership. Each of 92 patients was classified as having a pet or not, and whether they survived one year.

Here is the data, from Erika Friedmann et al., “Animal companions and one-year survival of patients after discharge from a coronary care unit.”:
Status Alive Dead
1 Owns a Pet 50 3
2 Does not own a Pet 28 11

Question: is there a statistically significant relationship (association) between Ownership and Survival?

What is an appropriate probability model here? For each patient in the population there are four possibilities: owns a pet-alive, owns a pet-dead, does not own a pet-alive, does not own a pet-dead. We can model this using a multinomial distribution: (X,Y) takes values (1,1), (1,2), (2,1), (2,2) with \(P((X,Y)=(i,j))=p_{ij}\). Of course we have

\(0 \le p_{ij} \le 1\) and \(\sum_{ij} p_{ij}=1\).

The math that follows get’s a little easier if we reparametrize the problem as follows: a discrete random vector with finitely many values is always equivalent to a multinomial distribution. So let Z be a rv with values 1-4 and probabilities \(p_1,..,p_4\).

Frequentist Analysis

Let’s begin by finding the mle’s of the \(p_i\)’s. Let \(z_i= \sum I[Z=i]\), then

\[ \begin{aligned} &f(\pmb{z}|\pmb{p}) = \prod p_i^{z_i}\\ & l(\pmb{p}) = \sum_{i} z_i \log p_i\\ \end{aligned} \]

Now we need to be careful because we need to maximize this function with the additional condition \(\sum p_i=1\) (otherwise the maxima is at infinity anyway), so we need to use Lagrange multipliers:

\[ \begin{aligned} &h(\pmb{p}) = \sum_{i} z_i \log p_i-\lambda (\sum_{i} p_i-1)\\ &\frac{dh}{dp_i} =\frac{z_i}{p_i}-\lambda=0 \\ &z_i=\lambda p_i\\ &n =\sum z_i =(\sum p_i)\lambda=\lambda \end{aligned} \]

and so we find \(\hat{p}_i=\frac{z_i}{p_i}\).

What does our question mean in terms of the \(p_i\)’s? If there is no relationship between Ownership and Survival then X and Y are independent and we should have

\[P((X,Y)=(i,j))=P(X=i)P(Y=j)\]

for i,j=1,2, Or

\(p_1=(p_1+p_2)(p_1+p_3)\)

\(p_2=(p_1+p_2)(p_2+p_4)\)

\(p_3=(p_1+p_3)(p_3+p_4)\)

\(p_4=(p_2+p_4)(p_3+p_4)\)

It’s easy to see why if you think in terms of marginals:

Status Alive Dead Total
1 Owns a Pet p1 p3 p1+p3
2 Does not own a Pet p2 p4 p2+p4
3 Total p1+p2 p3+p4

So let’s do the LRT test for this problem:

\(H_0\): X independent of Y equivalent to \(H_0\): above equations hold

we already found the mle’s, so now we need to find the numerator. First note that:

\[ \begin{aligned} &p_1 =(p_1+p_2)(p_1+p_3)= \\ &p_1^2+p_1(p_2+p_3)+p_2p_3 = \\ &p_1^2+p_1(1-p_1+p_4)+p_2p_3 = \\ &p_1-p_1p_4+p_2p_3 \end{aligned} \] so we find

\[p_1p_4-p_2p_3=0\] In the same way we can verify that the other equations also lead to this one. So we need to maximize

\[\max\left\{ \sum z_i \log (p_i) \vert p_i\ge 0;\sum p_i=1;p_1p_4-p_2p_3=0\right\}\] Again we use Lagrange multipliers:

\[ \begin{aligned} &h(\pmb{p}) = \sum_{i} z_i \log p_i-\lambda_1 (\sum_{i} p_i-1)+\lambda_2(p_1p_4-p_2p_3)\\ &\frac{dh}{dp_1} =\frac{z_i}{p_i}-\lambda_1+\lambda_2p_4=0 \\ &\frac{dh}{dp_2} =\frac{z_i}{p_i}-\lambda_1-\lambda_2p_3=0 \\ &\frac{dh}{dp_3} =\frac{z_i}{p_i}-\lambda_1-\lambda_2p_2=0 \\ &\frac{dh}{dp_4} =\frac{z_i}{p_i}-\lambda_1+\lambda_2p_1=0 \end{aligned} \] which has the solution

\[ \begin{aligned} &\hat{\hat{p}}_1 =\frac{z_1+z_2}{n}\frac{z_1+z_3}{n} \\ &\hat{\hat{p}}_2 =\frac{z_2+z_1}{n}\frac{z_2+z_4}{n} \\ &\hat{\hat{p}}_3 =\frac{z_3+z_1}{n}\frac{z_3+z_4}{n} \\ &\hat{\hat{p}}_4 =\frac{z_4+z_2}{n}\frac{z_4+z_3}{n} \\ \end{aligned} \]

Now let \(E_1=(z_1+z_2)(z_1+z_3)/n\), and so on, then

\[ \begin{aligned} &\lambda(\pmb{x}) =\prod \frac{(E_i/n)^{z_i}}{(z_i/n)^{z_i}} \\ &-2\log \lambda(\pmb{x}) = -2\sum z_i\log \frac{E_i}{z_i} =\\ &-2\sum z_i \left(\log1+(\frac{E_i}{z_i} -1)\right)=\\ = \\ &-2\sum z_i\left[(\frac{E_i}{z_i} -1)-\frac12(\frac{E_i}{z_i} -1)^2\right] =\\ &-2\sum\left[E_i-z_i-\frac12\frac{(E_i-z_i)^2}{z_i} \right]\approx\\ &\sum \frac{(z_i-E_i)^2}{E_i} \end{aligned} \] because \(E_i\approx z_i\).

which shows how one eventually ends up with the famous chisquare statistic:

\[X^2=\sum \frac{(O-E)^2}{E}\] for our data we have

O <- c(50, 28, 3, 11)
n <- sum(O)
E <- c((O[1]+O[2])*(O[1]+O[3]),
       (O[1]+O[2])*(O[2]+O[4]),
       (O[3]+O[4])*(O[1]+O[3]),
       (O[3]+O[4])*(O[2]+O[4]))/n
Status Alive Dead
1 Owns a Pet 50 (44.9) 3 (5.9)
2 Does not own a Pet 28 (33.1) 11 (8.1)
chi2 <- sum((O-E)^2/E)
round(c(chi2, 1-pchisq(chi2, 1)), 3) 
## [1] 8.851 0.003

this has a chisquare distribution with 1(=(r-1)(c-1)) df.

So again we see that one of the famous methods in Statistics can be derived from the likelihood ratio test (plus some extra approximations).


We have done this for a 2x2 table, but the generalization to an RxC table is straight forward.

Bayesian Analysis

as always this starts with a prior. If we again use the parametrization \((p_1,..,p_4)\) then \(Z=(z_1,..,z_4)\) has a multinomial distribution \((n,p_1,..,p_4)\) where n is assumed to be known.

A conjugate prior for the multinomial is the Dirichlet distribution with density

\[\pi(p)\propto\prod p_i^{\alpha_i-1}\]

and then \(\pi(p|z) \sim D(n, \alpha_1+z_1,.., \alpha_4+z_4)\).

The choice of \(\alpha_1=..=\alpha_4=1\) is a non-informative prior (\(p_i=1/k\)). The null hypothesis of independence then means independence of the posterior distribution, same as above. Indeed, under the non-informative prior we could again recover the chisquare test.