Bayesian Statistics

Prior and Posterior Distribution

Say we have a sample \(\textbf{X}=(X_1,..,X_n)\), iid from some probability density \(f(.| \theta)\), and we want to do some inference on the parameter \(\theta\).

A Bayesian analysis begins by specifying a prior distribution \(\pi(\theta)\). Then one uses Bayes’ formula to calculate the posterior distribution:

\[ f(\theta| \textbf{x}) = f(\textbf{x}|\theta)\pi(\theta)/m(\textbf{x}) \]

where \(m(\textbf{x})\) is the marginal distribution

\[ m(\textbf{x}) = \int ..\int f(\textbf{x}|\theta)\pi(\theta) d \theta \]

Example: Normal mean, normal prior

\(X \sim N(\mu, \sigma)\) independent, \(\sigma\) known, \(\mu \sim N(a, b)\).

Now

\[ \begin{aligned} &m(x) = \int_{-\infty}^\infty f(x|\mu)\pi(\mu) d\mu = \\ &\int_{-\infty}^\infty \frac{1}{\sqrt{2\pi \sigma^2}} e^{ -\frac1{2\sigma^2} (x-\mu)^2 } \frac{1}{\sqrt{2\pi b^2}} e^{ -\frac1{2 b^2} (\mu-a)^2 } d\mu \\ \end{aligned} \]

Note

\[ \begin{aligned} &(x-\mu)^2/\sigma^2 + (\mu-a)^2/b^2 = \\ &x^2/\sigma^2 - 2x\mu/\sigma^2 + \mu^2/\sigma^2 + \mu^2/b^2 - 2a\mu/b^2 +a^2/b^2 \\ &(1/\sigma^2+1/b^2)\mu^2 -2(x/\sigma^2+a/b^2 )\mu + K_1 = \\ &(1/\sigma^2+1/b^2) \left( \mu^2 -2 \frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2}\mu \right) +K_2 =\\ &\frac{(\mu-d)^2}{c^2}+K_3 \end{aligned} \]

where \(d=\frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2}\) and \(c=1/\sqrt{1/\sigma^2+1/b^2}\)

therefore

\[ \begin{aligned} &m(x) = K_4 \int_{-\infty}^\infty e^{ -\frac1{2c^2} (\mu-d)^2 } d \mu= K_5\\ \end{aligned} \]

because the integrand is a normal density with mean d and standard deviation c, so it will integrate to 1 as long as the constants are correct.

\[ \begin{aligned} &f(\theta| \textbf{x}) = f(\textbf{x}|\theta)\pi(\theta)/m(\textbf{x}) = \\ & K_6 e^{ -\frac1{2c^2} (\mu-d)^2 } \\ \end{aligned} \] Notice that we don’t need to worry about what exactly \(K_6\) is, because the posterior will be a proper probability density, so \(K_6\) will be what it has to be!

So we found

\[ \mu|X=x \sim N\left(\frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2}, 1/\sqrt{1/\sigma^2+1/b^2}\right) \] Let’s so a little simulation to see whether we got this right:

a <- 0.2 # just as an example
b <- 2.3
sigma <- 0.5
mu <- rnorm(1e5, a, b)
x <-  rnorm(1e5, mu, sigma)
x0 <- 0.1
cc <- 1/sqrt(1/sigma^2 + 1/b^2)
d <- (x0/sigma^2+a/b^2)/(1/sigma^2 + 1/b^2)
z <- mu[x>x0-0.05 & x<x0+0.05]
hist(z, 50, freq=FALSE)
curve(dnorm(x, d, cc), -2, 2, add=TRUE)

x0 <- (-1.1)
d <- (x0/sigma^2+a/b^2)/(1/sigma^2 + 1/b^2)
z <- mu[x>x0-0.05 & x<x0+0.05]
hist(z, 50, freq=FALSE)
curve(dnorm(x,d, cc), -3, 2, add=TRUE)

Example: Binomial proportion, Uniform prior

\(X \sim Bin(n, p)\), \(p \sim U[0,1]\)

\[ \begin{aligned} &m(x) = \int_{-\infty}^\infty f(x|\mu)\pi(\mu) d\mu = \\ &\int_0^1 {{n}\choose{x}}p^x(1-p)^{n-x}1dp =\\ &K_1\int_0^1 p^{(x+1)-1} (1-p)^{(n-x+1)-1}dp = K_2\\ \end{aligned} \]

because this is (up to a constant) a Beta density which will integrate to 1. So

\[ \begin{aligned} &f(\theta| \textbf{x}) = f(\textbf{x}|\theta)\pi(\theta)/m(\textbf{x}) = \\ & K_3 p^{(x+1)-1} (1-p)^{(n-x+1)-1} \\ \end{aligned} \] and we find

\[ p|X=x \sim \text{Beta}(x+1, n-x+1) \]

n <- 10
p <- runif(1e5)
x <-  rbinom(1e5, n, p)
x0 <- 3
z <- p[x==x0]
hist(z, 50, freq=FALSE)
curve(dbeta(x, x0+1, n-x0+1), -2, 2, add=T)

Inference

In a Bayesian analysis any inference is done from the posterior distribution. For example, point estimates can be found as the mean, median, mode or any other measure of central tendency:

Example: Normal mean, normal prior

say the following is a sample \(X_1,..,X_n\) from a normal with standard deviation \(\sigma=2.3\):

dta.norm
##  [1]  1.1  1.8  2.9  3.3  4.3  4.5  4.6  5.1  5.2  5.3  5.8  6.2  6.6  6.7
## [15]  6.9  7.1  7.7  7.7  7.8 10.5

if we decide to base our analysis on the sample mean we have \(\bar X \sim N(\mu, \sigma/\sqrt{n})\). Now if we use the posterior mean we find

\[ E[\mu|X=x] = \frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2} \]

now we need to decide what a and b to use. If we have some prior information we can use that. Say we expect a priori that \(\mu=5\), and of course we know \(\sigma=2.3\), then we could use \(a=5\) and \(b=3\):

d <- (mean(dta.norm)/(2.3^2/20) + 5/3^2)/(1/(2.3^2/20) + 1/3^2)
d
## [1] 5.539

If we wanted to find a 95% interval estimate (now called a credible interval) we can find it directly from the posterior distribution:

cc <- 1/sqrt(1/(2.3^2/20) + 1/3^2)
cc
## [1] 0.5069
round(qnorm(c(0.025, 0.975), d, cc), 2)
## [1] 4.55 6.53

Note: the standard frequentist solution would be

round(mean(dta.norm), 2)
## [1] 5.55
round(mean(dta.norm)+c(-1, 1)*qt(0.975, 12)*2.3/sqrt(20), 2)
## [1] 4.43 6.68

Example: Binomial proportion, uniform prior

Say in a class with 134 students 31 received an A. Find a 90% credible interval for the true percentage of students who get an A in this class.

round(qbeta(c(0.05, 0.95), 31 + 1, 134 - 31 + 1)*100, 1)
## [1] 17.8 29.7

Example: Normal mean, Gamma prior

let’s say that \(\mu\) is a physical quantity, like the mean amount of money paid on sales. In that case it makes more sense to use a prior that forces \(\mu\) to be non-negative. For example we could use \(\mu \sim \text{Gamma}(\alpha, \beta)\). However, now we need to find

\[ \begin{aligned} &m(x) = \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi \sigma^2}} e^{ -\frac1{2\sigma^2} (x-\mu)^2 } \frac1{\Gamma(\alpha)\beta^\alpha}\mu^{\alpha-1} e^{-\mu/\beta} d\mu \end{aligned} \]

and this integral does not exist. We will have to use numerical methods instead. Let’s again find a point estimate based on the posterior mean. As prior we will use \(\mu \sim \text{Gamma}(5, 1)\)

fmu <- function(mu) 
  dnorm(mean(dta.norm), mu, 2.3/sqrt(20))*
  dgamma(mu, 5, 1)
mx <- integrate(fmu, lower=0, upper=Inf)$value
posterior.density <- function(mu) fmu(mu)/mx
posterior.mean <- 
  integrate(
    function(mu) {mu*posterior.density(mu)}, 
    lower = 0, 
    upper = Inf)$value
round(posterior.mean, 2)
## [1] 5.49

how about a 95% credible interval? This we need to solve the equations

\[ F(\mu|\textbf{x})=0.025\text{, }F(\mu|\textbf{x})=0.975 \]

where \(F\) is the posterior distribution function. Again we need to work numerically. We can use a simple bisection algorithm:

pF <- function(t) integrate(posterior.density, 
            lower=3, upper=t)$value
cc <- (1-0.95)/2
l <- 3
h <- posterior.mean 
repeat {
    m <- (l+h)/2
    if(pF(m)<cc) l <- m
    else h <- m
    if(h-l<m/1000) break
}
left.endpoint <- m
h <- 8
l <- posterior.mean 
repeat {
    m <- (l+h)/2
    if(pF(m)<1-cc) l <- m
    else h <- m
    if(h-l<m/1000) break
}
right.endpoint <- m
round(c(left.endpoint, right.endpoint), 2)
## [1] 4.50 6.47

Let’s generalize all this and write a routine that will find a point estimate and a \((1-\alpha)100\%\) credible interval for any problem with one parameter:

bayes.credint <- function(x, df, prior, conf.level=0.95, acc=0.001,
          lower, upper,  Show=TRUE) {
  if(any(c(missing(lower), missing(upper))))
    cat("Need to give lower and upper boundary\n")
  posterior.density <- function(par, x) {
    y <- 0*seq_along(par)
    for(i in seq_along((par)))
      y[i] <- df(x, par[i])*prior(par[i])/mx
    y
  }
  mx <- 1
  mx <- integrate(posterior.density, 
          lower=lower, upper=upper, x=x)$value
  if(Show) {
     par <- seq(lower, upper, length=250)
     y <- posterior.density(par, x)
     plot(par, y, type="l")
  }
  f.expectation <- function(par, x) par*posterior.density(par, x)
  parhat <- integrate(f.expectation, 
              lower=lower, upper=upper, x=x)$value
  if(Show) abline(v=parhat)
  pF <- function(t, x) integrate(posterior.density, 
            lower=lower, upper=t, x=x)$value 
  cc <- (1-conf.level)/2
  l <- lower
  h <- parhat 
  repeat {
    m <- (l+h)/2
    if(pF(m, x)<cc) l <- m
    else h <- m
    if(h-l<acc*m) break
  }
  left.endpoint <- m
  h <- upper
  l <- parhat 
  repeat {
    m <- (l+h)/2
    if(pF(m, x)<1-cc) l <- m
    else h <- m
    if(h-l<acc*m) break
  }
  right.endpoint <- m
  if(Show) abline(v=c(left.endpoint, right.endpoint))
  c(parhat, left.endpoint, right.endpoint)
}

Example: Normal mean, normal prior

df <- function(x, par) dnorm(x, par, 2.3/sqrt(20))
prior <- function(par) dnorm(par, 5, 2.3)
round(bayes.credint(mean(dta.norm), df=df, prior=prior,
      lower=3, upper=8, Show=T), 2)

## [1] 5.53 4.54 6.51

Example: Binomial proportion, uniform prior

df <- function(x, par) dbinom(x, 134, par)
prior <- function(par) dunif(par)
round(100*bayes.credint(x=31, df=df, prior=prior, acc=0.0001,
   conf.level = 0.9,  lower=0.1, upper=0.4, Show=T), 1)

## [1] 23.5 17.8 29.7

which is the same before:

round(qbeta(c(0.05, 0.95), 31 + 1, 134 - 31 + 1)*100, 1)
## [1] 17.8 29.7

Example: Normal mean, Gamma prior

df <- function(x, par) dnorm(x, par, 2.3/sqrt(20))
prior <- function(par) dgamma(par, 5, 1)
round(bayes.credint(mean(dta.norm), df=df, prior=prior,
      lower=4, upper=7, Show=TRUE), 2)

## [1] 5.49 4.51 6.46

Example: Binomial proportion, Lincoln’s hat prior

Say we pick a coin from our pocket. We flip it 1000 time and get 578 heads. We want to find a 95% credible interval for the proportion of heads.

What would be good prior here? We might reason as follows: on the one had we are quite sure that indeed it is an “almost” fair coin. On the other hand if it is not a fair coin we really don’t know how unfair it might be. We can encode this in the Lincoln’s hat prior:

prior <- function(x) dunif(x) + dunif(x, 0.45, 0.55)
curve(prior, 0, 1, ylim=c(0, 11))

df <- function(x, par) dbinom(x, 1000, par)
round(bayes.credint(x=578, df=df, prior=prior, acc=0.0001,
      lower=0.5, upper=0.65, Show=TRUE), 3)

## [1] 0.567 0.535 0.607

So, have we just solved the Bayesian estimation problem for one parameter?

Example: Beta density, Gamma prior

consider the following sample:

dta.beta1 <- round(rbeta(1000, 2, 2), 3)
hist(dta.beta1, 50)

Let’s say we know that this is from a Beta(a, a) distribution and we want to estimate a. As a prior we want to use Gamma(2, 1)

Now what is df? Because this is an independent sample we find \(df(x, a) = \prod_{i=1}^n \text{dbeta}(x_i, a, a)\), so

df <- function(x, par) prod(dbeta(x, par, par))
prior <- function(par) dgamma(par, 2, 1)
round(bayes.credint(dta.beta1, df=df, prior=prior,
      lower=1.5, upper=2.5, Show=TRUE), 2)

## [1] 2.07 1.91 2.24

so far, so good. But now

dta.beta2 <- round(rbeta(10000, 2, 2), 3)
bayes.credint(dta.beta2, df=df, prior=prior,
      lower=1.5, upper=2.5, Show=TRUE)
## Error in integrate(posterior.density, lower = lower, upper = upper, x = x): non-finite function value

Why does this not work? The problem is that the values is \(\prod_{i=1}^n \text{dbeta}(x_i, a, a)\) get so small that R can’t handle them anymore!

Occasionally one can avoid this problem by immediately choosing a statistic T(x), aka a function of the data, so that T(X) has a distribution that avoids the product. That of course is just what we did above by going to \(\bar X\) in the case of the normal! In fact, it is also what we did in the case of the Binomial, because we replace the actual data (a sequence of Bernoulli trials) with their sum.


Can we generalize this to more than one parameter? In principle yes, but in practice no, at least not if the number of parameters is much more than 3 or 4. The main problem is the calculation of the marginal \(m(x)\), because numerical integration in higher-dimensional spaces is very difficult. In that case a completely different approach is used, namely sampling from the posterior distribution using so called MCMC (Markov Chain Monte Carlo) algorithms.

df <- function(x, par) dpois(sum(x), 10*par)
prior <- function(par) 1/sqrt(par)
x <- rpois(10, 5)
bayes.credint(x=37, df=df, prior=prior, lower=0, upper=10, Show=T)

## [1] 3.750 2.646 5.041