priors

Priors

The main issue in a Bayesian analysis is always the choice of priors. Here are some of the common methods:

Subjective Prior

In many ways this is was is required by the Bayesian paradigm, namely to find a prior that “encodes” your subjective belief about the parameter.

Example (3.2.1)

let’s consider again the the coin example from before. Here is a very different, but probably more realistic, prior: Before we flip the coin we reason as follows: either the coin is fair, and we think that is most likely the case, or if it is not, we don’t have any idea what it might be. We can “encode” this belief in the following prior:

Let \(\delta_{1/2}\) be the point mass at 1/2, that is a random variable which always takes the value 1/2, or \(P(\delta_{1/2} = 1/2) = 1\). Let \(U\sim U[0,1]\) and let \(Z\sim Ber(\alpha)\). Now \(p = Z\delta_{1/2} + (1-Z)U\).

So with probability \(\alpha\) p is just 1/2 (and the coin is fair), and with probability \(1-\alpha\) p is uniform on [0,1] (and we don’t have any idea what’s going on).

This prior is a mixture of a discrete and a continuous r.v. so we will need to be a bit careful with the calculations.

The cdf and the prior density are as follows

\[ \begin{aligned} &F(p) = P(p\le x) = \\ &P(p\le x\vert Z=0)P(Z=0)+P(p\le x\vert Z=1)P(Z=1) = \\ &\left\{ \begin{array}{lll} 0\times \alpha+x(1-\alpha) & \text{if} &x<\frac12 \\ 1\times \alpha+x(1-\alpha) & \text{if} & x\ge \frac12 \end{array} \right. = \\ &\left\{ \begin{array}{lll} x(1-\alpha) & \text{if} &x<\frac12 \\ \alpha+x(1-\alpha) & \text{if} & x\ge \frac12 \end{array} \right. \end{aligned} \] and so for \(x\ne \frac12\) \(f(p)=1-\alpha\) because then the cdf is a straight line with slope \(1-\alpha\). Also

\[P(p= \frac12 )=\alpha+ \frac12 (1-\alpha)- \frac12 (1-\alpha)=\alpha\]

The joint density of p and y is:

\[f(y,p)=f(y|p)\pi(p) = \left\{ \begin{array}{lll} (1-\alpha){n \choose y}p^y(1-p)^{n-y} & \text{if} & p\ne \frac12 \\ \alpha{n \choose y}p^y(1-p)^{n-y} & \text{if} & p= \frac12 \end{array}\right.\]

and the marginal distribution of y:

\[ \begin{aligned} &m(y) = P(Y=y\vert Z=0)P(Z=0)+P(Y=y\vert Z=1)P(Z=1) =\\ &P(Y=y\vert p=\frac12)\alpha+P(Y=y\vert p\sim U[0,1])(1-\alpha) = \\ &\alpha{n \choose y}p^y(1-p)^{n-y}+\int_0^1 (1-\alpha){n \choose y}p^y(1-p)^{n-y}dp = \\ &\alpha{n \choose y}(\frac12)^y(1-(\frac12))^{n-y}+\\ &(1-\alpha){n \choose y}\frac{\Gamma(y+1)\Gamma(n-y+1)}{\Gamma(n+2)}\int_0^1 \frac{\Gamma(n+2)}{\Gamma(y+1)\Gamma(n-y+1)}p^{(y+1)-1}(1-p)^{(n-y+1)-1}dp = \\ &\alpha{n \choose y}(\frac12)^n+(1-\alpha){n \choose y}\frac{\Gamma(y+1)\Gamma(n-y+1)}{\Gamma(n+2)} = \\ &\alpha{n \choose y}(\frac12)^n+(1-\alpha)\frac{n!}{(n-y)!y!}\frac{y!(n-y)!}{(n+1)!} =\\ &\alpha{n \choose y}/2^n+\frac{1-\alpha}{n+1} \end{aligned} \]

because the uniform is also a Beta(1,1) and we can use the result above.

Next we find the posterior distribution of p given y:

\[f(p\vert y) = \left\{ \begin{array}{lll} (1-\alpha){n \choose y}p^y(1-p)^{n-y}/\left[\alpha{n \choose y}/2^n+\frac{1-\alpha}{n+1}\right] & \text{if} &p\ne \frac12 \\ \alpha{n \choose y}p^y(1-p)^{n-y}/\left[\alpha{n \choose y}/2^n+\frac{1-\alpha}{n+1}\right] & \text{if} & p= \frac12 \end{array}\right.\]

Finally as above we can find the mean of the posterior distribution to get an estimate of p for a given y. To simplify the notation let \(K=\alpha{n \choose y}/2^n+\frac{1-\alpha}{n+1}\), and then

\[ \begin{aligned} &E[p|y] = \\ &\frac12 \alpha{n \choose y}/2^n/K + \int_0^1 p(1-\alpha){n \choose y}p^y(1-p)^{n-y}/Kdp = \\ & \alpha{n \choose y}/2^{n+1}/K + \\ &(1-\alpha){n \choose y}\int_0^1 p^{y+1}(1-p)^{n-y}/Kdp = \\ & \alpha{n \choose y}/2^{n+1}/K + \\ &(1-\alpha){n \choose y}\frac{\Gamma(y+2)\Gamma(n-y+1)}{\Gamma(n+3)}\int_0^1 \frac{\Gamma(n+3)}{\Gamma(y+2)\Gamma(n-y+1)}p^{(y+2)-1}(1-p)^{(n-y+1)-1}dp/K = \\ &\alpha{n \choose y}/2^{n+1}/K + (1-\alpha){n \choose y}\frac{\Gamma(y+2)\Gamma(n-y+1)}{\Gamma(n+3)}/K = \\ &\left[\alpha{n \choose y}/2^{n+1} + (1-\alpha)\frac{n!}{(n-y)!y!}\frac{(y+1)!(n-y)!}{(n+2)!}\right]/K = \\ &\left[\alpha{n \choose y}/2^{n+1} + (1-\alpha)\frac{y+1}{(n+2)(n+1)}\right]/\left[\alpha{n \choose y}/2^n+\frac{1-\alpha}{n+1}\right] \end{aligned} \]

Again let’s study the effect of the parameters in the posterior distribution:

effect of the sample size on the estimate of p

n <- 10*1:50
y <- 3*1:50
phat <- 0*n
alpha <- 0.999
for (i in 1:50) 
  phat[i] <- (alpha*choose(n[i], y[i])/2^(n[i]+1) + 
    (1-alpha)*(y[i]+1)/(n[i]+2)/(n[i]+1))/(alpha*choose(n[i], y[i])/2^n[i] + (1-alpha)/(n[i]+1))
ggplot(data.frame(n=n, phat=phat), aes(n, phat)) +
  geom_point()

the effect of alpha on the estimate of p

alpha <- seq(0.01, 0.99, length = 100)
phat <- 0 * alpha
for (i in 1:100) 
  phat[i] <- (alpha[i]*choose(10, 3)/2^(10 + 1) + 
      (1-alpha[i])*(3+1)/(10+2)/(10+1))/(alpha[i]*
      choose(10, 3)/2^10 + (1 - alpha[i])/(10+1))
ggplot(data.frame(alpha=alpha, phat=phat), aes(alpha, phat)) +
  geom_point()

A larger alpha means a prior that puts more weight on the coin being fair.

Generally finding a good subjective prior can be done using utility theory.

Conjugate Priors

these are priors which together with the sampling distribution lead to a posterior distribution of the same type as the prior. We already saw one example:

Example (3.2.2)

Binomial-Beta

Say \(X_1,...,X_n\sim Bin(n,p)\) and \(p\sim Beta(\alpha,\beta)\), then

\[p|\pmb{X=x}\sim Beta(\alpha+\sum x_i,n-\sum x_i+\beta)\]

Here is another:

Example (3.2.3)

Poisson-Gamma

Say \(X_1,...,X_n\sim Pois(\lambda)\) and \(\lambda\sim Gamma(\alpha,\beta)\), then

\[\lambda|\pmb{X=x}\sim Gamma(\alpha+\sum x_i,\beta+n)\]

and of course we have

Example (3.2.4)

Normal-Normal

\(X_1,...,X_n\sim N(\mu,\sigma)\) and \(\mu\sim N(\tau,\theta)\), where \(\sigma,\tau\) and \(\theta\) are assumed to be known.

Then from (2.3.2) we have

\[f(x_1,...,x_n|\mu) =(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\sum(x_i-\mu)^2 \right\}\] and so the joint density is given by

\[f(x_1,...,x_n,\mu) =\\(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\sum(x_i-\mu)^2 \right\}(2\pi\theta^2)^{-1/2}\exp \left\{-\frac1{2\theta^2}\sum(\mu-\tau)^2 \right\}\] Now \[ \begin{aligned} &\sum(x_i-\mu)^2 = \\ &\sum(x_i-\bar{x}+\bar{x}-\mu)^2 = \\ &\sum \left[(x_i-\bar{x})^2+2(x_i-\bar{x})(\bar{x}-\mu)+(\bar{x}-\mu)^2 \right]= \\ &\sum (x_i-\bar{x})^2+2(\bar{x}-\mu)\sum (x_i-\bar{x})+n(\bar{x}-\mu)^2 = \\ &\sum (x_i-\bar{x})^2+2(\bar{x}-\mu) (\sum x_i-n\bar{x})+n(\bar{x}-\mu)^2 = \\ &\sum (x_i-\bar{x})^2+n(\bar{x}-\mu)^2 \end{aligned} \]

therefore

\[ \begin{aligned} &f(x_1,...,x_n,\mu) = \\ &(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\left[\sum (x_i-\bar{x})^2+n(\bar{x}-\mu)^2\right] \right\} \frac1{\sqrt{2\pi\theta^2}}\exp \left\{ -\frac1{2\theta^2}(\mu-\tau)^2 \right\} = \\ &\exp \left\{-\frac1{2\sigma^2}\left[\sum (x_i-\bar{x})^2\right] \right\} (2\pi\sigma^2)^{-n/2}\frac1{\sqrt{2\pi\theta^2}}\exp \left\{-\frac12 \left[\frac{n(\bar{x}-\mu)^2}{\sigma^2} +\frac{(\mu-\tau)^2}{\theta^2}\right]\right\} \end{aligned} \] Now

\[T=\frac{n(\bar{x}-\mu)^2}{\sigma^2} +\frac{(\mu-\tau)^2}{\theta^2} = \frac{n\theta^2(\bar{x}-\mu)^2+\sigma^2(\mu-\tau)^2}{\sigma^2\theta^2}\]

Note that

\[\frac{n(\bar{x}-\mu)^2}{\sigma^2} +\frac{(\mu-\tau)^2}{\theta^2} = \frac{n\theta^2(\bar{x}-\mu)^2+\sigma^2(\mu-\tau)^2}{\sigma^2\theta^2}\]

In the numerator we have

\[ \begin{aligned} &n\theta^2(\bar{x}-\mu)^2+\sigma^2(\mu-\tau)^2=\\ &n\theta^2(\bar{x}^2-2\bar{x}\mu+\mu^2)+\sigma^2(\mu^2-\mu\tau+\tau^2) = \\ &(n\theta^2+\sigma^2)\mu^2-2(n\theta^2\bar{x}^2+\sigma^2\tau)\mu+n\theta^2\bar{x}^2+\sigma^2\tau^2 = \\ &(n\theta^2+\sigma^2)\left\{\mu^2-2\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2}\mu\right\}+n\theta^2\bar{x}^2+\sigma^2\tau^2 = \\ &(n\theta^2+\sigma^2)\left\{\mu^2-2\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2}\mu+\left(\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2}\right)^2-\left(\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2}\right)^2\right\}\\ \text{ } &+n\theta^2\bar{x}^2+\sigma^2\tau^2 = \\ &(n\theta^2+\sigma^2)\left\{\mu-\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2}\right\}^2-\frac{(n\theta^2\bar{x}^2+\sigma^2\tau)^2}{n\theta^2+\sigma^2}+n\theta^2\bar{x}^2+\sigma^2\tau^2 \\ \end{aligned} \] and so

\[ \begin{aligned} &T = \\ &\left\{ (n\theta^2+\sigma^2)\left\{\mu-\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2}\right\}^2-\frac{(n\theta^2\bar{x}^2+\sigma^2\tau)^2}{n\theta^2+\sigma^2}+n\theta^2\bar{x}^2+\sigma^2\tau^2\right\}/(\sigma^2\theta^2) \\ &(n/\sigma^2+1/\theta^2)\left\{\mu-\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2}\right\}^2-\frac{(n\theta^2\bar{x}^2+\sigma^2\tau)^2}{(n\theta^2+\sigma^2)\sigma^2\theta^2}+n\bar{x}^2/\sigma^2+\tau^2/\theta^2 \end{aligned} \]

Let’s define

\[ \begin{aligned} &\mu_1=\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2} = \\ &\frac{n\bar{x}^2/\sigma^2+\tau/\theta^2}{(n\theta^2+\sigma^2)/(\sigma^2\theta^2)} = \\ &\frac{n\bar{x}^2/\sigma^2+\tau/\theta^2}{\frac{n}{\sigma^2}+\frac1{\theta^2}} \end{aligned} \]

and

\[\sigma_1^2=\frac{\sigma^2\theta^2}{n\theta^2+\sigma^2}=\frac{1}{\frac{n}{\sigma^2}+\frac1{\theta^2}}\]

and so

\[ \begin{aligned} &T=\\ &(n/\sigma^2+1/\theta^2)\left\{\mu-\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2}\right\}^2-\frac{(n\theta^2\bar{x}^2+\sigma^2\tau)^2}{(n\theta^2+\sigma^2)\sigma^2\theta^2}+n\bar{x}^2/\sigma^2+\tau^2/\theta^2 =\\ &\left(\frac{\mu-\mu_1}{\sigma_1}\right)^2-\left(\frac{n\theta^2\bar{x}^2+\sigma^2\tau}{n\theta^2+\sigma^2}\right)^2\frac{n\theta^2+\sigma^2}{\sigma^2\theta^2}+n\bar{x}^2/\sigma^2+\tau^2/\theta^2 = \\ &\left(\frac{\mu-\mu_1}{\sigma_1}\right)^2-\mu_1/\sigma_1^2+n\bar{x}^2/\sigma^2+\tau^2/\theta^2 \end{aligned} \]

Now we put it all together:

\[ \begin{aligned} &f(x_1,...,x_n,\mu) = \\ &\exp \left\{-\frac1{2\sigma^2}\left[\sum (x_i-\bar{x})^2\right] \right\}\times\\ &(2\pi\sigma^2)^{-n/2}\frac1{\sqrt{2\pi\theta^2}}\exp \left\{-\frac12 \left[\frac{n(\bar{x}-\mu)^2}{\sigma^2} +\frac{(\mu-\tau)^2}{\theta^2}\right]\right\} = \\ &\exp \left\{-\frac1{2\sigma^2}\left[\sum (x_i-\bar{x})^2\right] \right\}\times \\ &(2\pi\sigma^2)^{-n/2}\frac1{\sqrt{2\pi\theta^2}}\exp \left\{-\frac12 \left[ \left(\frac{\mu-\mu_1}{\sigma_1}\right)^2-\mu_1/\sigma_1^2+n\bar{x}^2/\sigma^2+\tau^2/\theta^2\right]\right\} = \\ &\exp \left\{-\frac1{2\sigma^2}\left[\sum (x_i-\bar{x})^2-\mu_1\sigma^2/\sigma_1^2+n\bar{x}^2+\sigma^2\tau^2/\theta^2\right] \right\} (2\pi\sigma^2)^{-n/2}\frac1{\sqrt{2\pi\theta^2}}\sqrt{2\pi\sigma_1^2}\\ &\frac1{\sqrt{2\pi\sigma_1^2}}\exp \left\{-\frac12 \left(\frac{\mu-\mu_1}{\sigma_1}\right)^2\right\} \\ \end{aligned} \]

\[ \begin{aligned} &m(\pmb{x}) = \\ &\exp \left\{-\frac1{2\sigma^2}\left[\sum (x_i-\bar{x})^2-\mu_1\sigma^2/\sigma_1^2+n\bar{x}^2+\sigma^2\tau^2/\theta^2\right] \right\} (2\pi\sigma^2)^{-n/2}\frac1{\sqrt{2\pi\theta^2}}\sqrt{2\pi\sigma_1^2}\\ &f(\mu|\pmb{x}) =\frac{f(\pmb{x},\mu)}{m(\pmb{x})} = \frac1{\sqrt{2\pi\sigma_1^2}}\exp \left\{-\frac12 \left(\frac{\mu-\mu_1}{\sigma_1}\right)^2\right\} \\ \end{aligned} \] and so the posterior distribution is a \(N(\mu_1,\sigma_1)\)

Non-informative Priors

just what it says, a prior that does not contain any “information” on the parameter.

Example (3.2.5)

\(X_1, ..., X_n\sim Ber(p)\), then \(p\sim U[0,1]\) is a non-informative prior.

Example (3.2.6)

\(X_1, .., X_n\sim N(\mu,\sigma)\), \(\sigma\) known. Now \(\mu\) can be any real number, so a prior has to be a density on the whole real line. But \(\pi(\mu)=c\) is not possible because it integrates out to infinity for any \(c>0\)!

There are two solutions to this problem:

Allow improper priors, that is priors with an infinite integral. This is generally ok as long as the posterior is a proper density:

\[ \begin{aligned} &f(\pmb{x}|\mu) =\prod_{i=1}^n \frac1{\sqrt{2\pi\sigma^2}}\exp \left\{-\frac{(x_i-\mu)^2}{2\sigma^2} \right\} =\\ &(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac{\sum (x_i-\mu)^2}{2\sigma^2} \right\} = \\ &(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac{\sum (x_i-\bar{x})^2+n(\mu-\bar{x})^2}{2\sigma^2} \right\} = \\ &(2\pi\sigma^2)^{-(n-1)/2}/\sqrt{n}\exp \left\{-\frac{\sum (x_i-\bar{x})^2}{2\sigma^2} \right\} \frac1{\sqrt{2\pi\sigma^2/n}}\exp \left\{-\frac{(\mu-\bar{x})^2}{2\sigma^2/n} \right\} \end{aligned} \]

where we used (2.2.4). Next we find the marginal:

\[ \begin{aligned} &m(\pmb{x})=\int_{-\infty}^{\infty}f(\pmb{x}|\mu)\pi(\mu)d\mu=\\ &(2\pi\sigma^2)^{-(n-1)/2}/\sqrt{n}\exp \left\{-\frac{\sum (x_i-\bar{x})^2}{2\sigma^2} \right\}\int_{-\infty}^{\infty}\frac1{\sqrt{2\pi\sigma^2/n}}\exp \left\{-\frac{(\mu-\bar{x})^2}{2\sigma^2/n} \right\} d\mu = \\ &(2\pi\sigma^2)^{-(n-1)/2}/\sqrt{n}\exp \left\{-\frac{\sum (x_i-\bar{x})^2}{2\sigma^2} \right\} \end{aligned} \]

because the integral is over a normal density and therefore equal to 1. Finally we find the posterior distribution:

\[ \begin{aligned} &f(\mu|\pmb{x}) =\frac{f(\pmb{x}|\mu)\pi(\mu)}{m(\pmb{x})} \\ &\frac{(2\pi\sigma^2)^{-(n-1)/2}/\sqrt{n}\exp \left\{-\frac{\sum (x_i-\bar{x})^2}{2\sigma^2} \right\} \frac1{\sqrt{2\pi\sigma^2/n}}\exp \left\{-\frac{(\mu-\bar{x})^2}{2\sigma^2/n} \right\}}{(2\pi\sigma^2)^{-(n-1)/2}/\sqrt{n}\exp \left\{-\frac{\sum (x_i-\bar{x})^2}{2\sigma^2} \right\}} = \\ &\frac1{\sqrt{2\pi\sigma^2/n}}\exp \left\{-\frac{(\mu-\bar{x})^2}{2\sigma^2/n} \right\} \end{aligned} \]

and so we find \(\mu|\pmb{X=x}\sim N(\bar{x},\sigma/\sqrt{n})\)

One justification for this is that we usually can express the improper prior as the limit of a sequence of proper priors: we saw already that if \(X\sim N(\mu,\sigma)\) and \(\mu\sim N(\tau ,\theta)\), then \(\mu|\pmb{X=x}\sim N(\mu_1,\sigma_1)\) and

\[ \begin{aligned} &\mu_1=\frac{n\bar{x}/\sigma^2+\tau/\theta^2}{\frac{n}{\sigma^2}+\frac1{\theta^2}}\rightarrow \bar{x}\\ &\sigma_1^2=\frac{1}{\frac{n}{\sigma^2}+\frac1{\theta^2}}\rightarrow \sigma^2/n \end{aligned} \]

A second solution is to think a little more about what “non-informative” really means.

Example (3.2.7)

Say we have \(X_1, .., X_n\sim N(\mu,\sigma)\) and we want to estimate \(\sigma\). At first it seems we should use \(\pi(\sigma)=1\), \(\sigma>0\). It turns out, though, that this is not really “completely non-informative” because of the following: say we estimate \(\sigma=2.7\), then there is small interval \((0,2.7)\) “below” our estimate but a very large interval \((2.7, \infty )\) “above” it.

There is a class of priors that were developed explicitly to express this idea of “complete lack of knowledge” called Jeffrey’s priors:

Let’s say \(X_1, .., X_n\sim f(x|\theta)\). Jeffrey considered one-to-one transformations of the parameter \(\delta =u(\theta)\). From our discussions in probability we know that the prior density of \(\delta\) is given by

\[ f(\delta)=f(\theta)|u^{-1}(\theta)|' \]

Jeffrey argued that true lack of knowledge means that any transformation of the parameter should yield the equivalent result. Essentially it should not matter whether we measure temperature in Fahrenheit or Centigrade. He showed that this implies that the prior should be proportional to the square root of the Fisher information:

\[\pi(\theta) \propto \sqrt{J(\theta)}\] where

\[J(\theta)=-E\left[\frac{d^2\log f(x|\theta)}{d\theta^2}\right]\]

here this means

\[ \begin{aligned} &\log f(x|\sigma) =\log\left[\frac1{\sqrt{2\pi\sigma^2}}\exp \left\{-\frac1{2\sigma^2}(x-\mu)^2 \right\} \right] =\\ &-\frac12\log(2\pi)-2\log\sigma -\frac1{2\sigma^2}(x-\mu)^2 \\ &\frac{d\log f(x|\theta)}{d\theta} = -\frac2\sigma +\frac1{\sigma^3}(x-\mu)^2\\ &\frac{d^2\log f(x|\theta)}{d\theta^2}= \frac2{\sigma^2} -\frac3{\sigma^4}(x-\mu)^2\\ &E\left[\frac{d^2\log f(x|\theta)}{d\theta^2}\right] = \\ &E\left[\frac2{\sigma^2} -\frac3{\sigma^4}(X-\mu)^2\right] =\\ &\frac2{\sigma^2} -\frac3{\sigma^4}E[(X-\mu)^2] =\\ &\frac2{\sigma^2} -\frac3{\sigma^4}\sigma^2 = -1/\sigma^2\\ &J(\sigma)=-E\left[\frac{d^2\log f(x|\theta)}{d\sigma^2}\right]=1/\sigma^2 \end{aligned} \]

and so Jeffrey’s prior for \(\sigma\) is \(\sqrt{1/\sigma^2}=1/\sigma\). Note that this is an improper prior.

Empirical Bayes

Example (3.2.8)

say \(X_1, .., X_n\sim Pois( \lambda )\) and \(\lambda\sim Gamma(\alpha,\beta)\), then we know that

\[\lambda|X=x\sim Gamma(\alpha+\sum x_i, \beta+n)\]

But how do we choose \(\alpha\) and \(\beta\)? In a subjective Bayesian analysis we would need to use prior knowledge to estimate them. The idea of empirical Bayes is to use the data itself to estimate the “hyper-parameters” \(\alpha\) and \(\beta\). For example, we know that the mean of a Gamma(\(\alpha\),\(\beta\)) is \(\alpha\beta\) and the variance is \(\alpha\beta^2\). So we choose \(\alpha\) and \(\beta\) as the solutions of the system of non-linear equations

\(\bar{X} =\alpha\beta\)
\(s^2=\alpha\beta^2\)

\(s^2=\alpha\beta\times\beta=\bar{X}\times\beta\)

\(\beta=s^2/\bar{X}\) and \(\alpha=\bar{X}/\beta=\bar{X} ^2/s^2\).

One can go even further and use the empirical distribution function as the cdf of the prior itself. In many ways, though, empirical Bayes goes against the spirit of Bayesian analysis. It also violates the likelihood principle!

These are just some of the methods for finding priors, there are many others.

The situation becomes much more difficult in multi-dimensional problems, and there is a lot of disagreement among Bayesians what constitutes a good prior there.

Calculation Issues

In most cases actual calculations have to be done using numerical methods and/or simulation.

Numerical Solutions

Example (3.2.9)

the following is a data set from a Beta distribution with parameters \(\alpha\) and \(\beta=2.5\):

0.62	0.62	0.34	0.58	0.50	0.46	0.64	0.40	0.44	0.29
0.76	0.45	0.49	0.55	0.76	0.75	0.77	0.19	0.58	0.65
0.30	0.50	0.42	0.46	0.36	0.68	0.07	0.67	0.48	0.17
0.03	0.36	0.76	0.30	0.12	0.22	0.29	0.33	0.89	0.78
0.29	0.09	0.33	0.03	0.45	0.30	0.86	0.44	0.49	0.06

We want to find an estimate for \(\alpha\) using as a prior \(\alpha\sim N(1.5, 1/4)\) and the median of the posterior distribution. The marginal turns out to be

\[m(\pmb{x})=\int_0^\infty \left(\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\right)^n\left(\prod x_i\right)^{\alpha-1}\left(\prod (1-x_i)\right)^{\beta-1}\frac1{\sqrt{\pi/8}}\exp\left\{-8(\alpha-1.5)^2\right\}d\alpha\] clearly there is no way to find this integral analytically, so we will do it numerically:

find.alpha <- function (data) 
{
  marginal <- 1
  integrant <- function(t) {
  y <- length(t)
  for(i in 1:length(t)) 
    y[i] <- prod(dbeta(data,t[i], 2.5))*
            dnorm(t[i], 1.5, 0.25)/marginal
    y
  } 
  marginal <- integrate(integrant,0,4)$value
  step <- c(1, 0.1 ,0.01)
  a<-0
  for(i in 1:3) {
      repeat {
          a <- a+step[i]
          I <- integrate(integrant,0,a)$value
          print(c(a,I))
          if(I>0.5) break
      }
      a <- a-step[i]
    } 
    a 
}
find.alpha(beta.data)

## [1] 1.000000e+00 1.108583e-07
## [1] 2.0000000 0.9560108
## [1] 1.100000e+00 5.531253e-06
## [1] 1.2000000000 0.0001362987
## [1] 1.300000000 0.001806294
## [1] 1.40000000 0.01382355
## [1] 1.50000000 0.06505595
## [1] 1.6000000 0.1998564
## [1] 1.7000000 0.4262885
## [1] 1.8000000 0.6760933
## [1] 1.7100000 0.4518548
## [1] 1.7200000 0.4775687
## [1] 1.730000 0.503323

## [1] 1.72

Simulation

Because analytic solutions are only possible in simple problems and numerical solutions are difficult in problems with more than 3 or 4 parameters the most common solution today is to simulate data from the posterior distribution.

This is possible because using a methodology called Markov Chain Monte Carlo (MCMC) it is possible to sample from the posterior without having to know any constants, that is without having to find the marginal m(\(\pmb{x}\)).

Example (3.2.10)

Let’s do the example above again. The routine betaMCMC() generates data from the posterior and finds the sample median. Notice that in the definition of the posterior we just use the joint density, without the m(\(\pmb{x}\)).

betaMCMC <- function (x, just.alpha=TRUE, B = 10000) {
  if(just.alpha) 
    posterior <- function(a, b) 
      prod(dbeta(x, a, b)) * dnorm(a, 1.5, 0.25) 
  else    
    posterior <- function(a, b) 
      prod(dbeta(x, a, b)) * 
      dnorm(a, 1.5, 0.25) * dnorm(b, 2.5, 0.5)
                                  
  A <- matrix(0, B, 2)
  A[1, ] <- c(1.7, 2.5)
  for (i in 2:B) {
    u <- runif(1, max(0, A[i-1, 1]-0.5), A[i-1, 1]+0.5)
    if(!just.alpha) 
      v <- runif(1, max(0, A[i-1, 2]-0.5), A[i-1, 2]+0.5)
    else v <- 2.5
    if (runif(1) < posterior(u, v)/posterior(A[i-1, 1], 
                                   A[i-1, 2])) 
            A[i, ] <- c(u, v)
    else A[i, ] <- A[i - 1, ]
  }
  bw <- diff(range(A[, 1]))/50 
  plot(ggplot(data.frame(x=A[, 1]), aes(x)) +
      geom_histogram(aes(y = ..density..),
        color = "black", 
        fill = "white", 
        binwidth = bw))  
  if(just.alpha) return(median(A[, 1]))    
  bw <- diff(range(A[, 2]))/50 
  plot(ggplot(data.frame(x=A[, 2]), aes(x)) +
      geom_histogram(aes(y = ..density..),
        color = "black", 
        fill = "white", 
        binwidth = bw))  
  apply(A, 2, median)
}

betaMCMC(beta.data)

## [1] 1.722946

It is fairy simple to include \(\beta\) as a free parameter as well. As a prior for \(\beta\) we will use N(2.5,0.5)

betaMCMC(beta.data, just.alpha = FALSE)

## [1] 1.606672 2.148831

Want to know why this generates data from the posterior? Come to ESMA 5015 Simulation!