Say we have a sample \(\textbf{X}=(X_1,..,X_n)\), iid from some probability density \(f(.| \theta)\), and we want to do some inference on the parameter \(\theta\).
A Bayesian analysis begins by specifying a prior distribution \(\pi(\theta)\). Then one uses Bayes’ formula to calculate the posterior distribution:
\[ f(\theta| \textbf{x}) = f(\textbf{x}|\theta)\pi(\theta)/m(\textbf{x}) \]
where \(m(\textbf{x})\) is the marginal distribution
\[ m(\textbf{x}) = \int ..\int f(\textbf{x}|\theta)\pi(\theta) d \theta \]
\(X \sim N(\mu, \sigma)\) independent, \(\sigma\) known, \(\mu \sim N(a, b)\).
Now
\[ \begin{aligned} &m(x) = \int_{-\infty}^\infty f(x|\mu)\pi(\mu) d\mu = \\ &\int_{-\infty}^\infty \frac{1}{\sqrt{2\pi \sigma^2}} e^{ -\frac1{2\sigma^2} (x-\mu)^2 } \frac{1}{\sqrt{2\pi b^2}} e^{ -\frac1{2 b^2} (\mu-a)^2 } d\mu \\ \end{aligned} \]
Note
\[ \begin{aligned} &(x-\mu)^2/\sigma^2 + (\mu-a)^2/b^2 = \\ &x^2/\sigma^2 - 2x\mu/\sigma^2 + \mu^2/\sigma^2 + \mu^2/b^2 - 2a\mu/b^2 +a^2/b^2 \\ &(1/\sigma^2+1/b^2)\mu^2 -2(x/\sigma^2+a/b^2 )\mu + K_1 = \\ &(1/\sigma^2+1/b^2) \left( \mu^2 -2 \frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2}\mu \right) +K_2 =\\ &\frac{(\mu-d)^2}{c^2}+K_3 \end{aligned} \]
where \(d=\frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2}\) and \(c=1/\sqrt{1/\sigma^2+1/b^2}\)
therefore
\[ \begin{aligned} &m(x) = K_4 \int_{-\infty}^\infty e^{ -\frac1{2c^2} (\mu-d)^2 } d \mu= K_5\\ \end{aligned} \]
because the integrand is a normal density with mean d and standard deviation c, so it will integrate to 1 as long as the constants are correct.
\[ \begin{aligned} &f(\theta| \textbf{x}) = f(\textbf{x}|\theta)\pi(\theta)/m(\textbf{x}) = \\ & K_6 e^{ -\frac1{2c^2} (\mu-d)^2 } \\ \end{aligned} \] Notice that we don’t need to worry about what exactly \(K_6\) is, because the posterior will be a proper probability density, so \(K_6\) will be what it has to be!
So we found
\[ \mu|X=x \sim N\left(\frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2}, 1/\sqrt{1/\sigma^2+1/b^2}\right) \] Let’s so a little simulation to see whether we got this right:
a <- 0.2 # just as an example
b <- 2.3
sigma <- 0.5
mu <- rnorm(1e5, a, b)
x <- rnorm(1e5, mu, sigma)
x0 <- 0.1
cc <- 1/sqrt(1/sigma^2 + 1/b^2)
d <- (x0/sigma^2+a/b^2)/(1/sigma^2 + 1/b^2)
z <- mu[x>x0-0.05 & x<x0+0.05]
hist(z, 50, freq=FALSE)
curve(dnorm(x, d, cc), -2, 2, add=TRUE)
x0 <- (-1.1)
d <- (x0/sigma^2+a/b^2)/(1/sigma^2 + 1/b^2)
z <- mu[x>x0-0.05 & x<x0+0.05]
hist(z, 50, freq=FALSE)
curve(dnorm(x,d, cc), -3, 2, add=TRUE)
\(X \sim Bin(n, p)\), \(p \sim U[0,1]\)
\[ \begin{aligned} &m(x) = \int_{-\infty}^\infty f(x|\mu)\pi(\mu) d\mu = \\ &\int_0^1 {{n}\choose{x}}p^x(1-p)^{n-x}1dp =\\ &K_1\int_0^1 p^{(x+1)-1} (1-p)^{(n-x+1)-1}dp = K_2\\ \end{aligned} \]
because this is (up to a constant) a Beta density which will integrate to 1. So
\[ \begin{aligned} &f(\theta| \textbf{x}) = f(\textbf{x}|\theta)\pi(\theta)/m(\textbf{x}) = \\ & K_3 p^{(x+1)-1} (1-p)^{(n-x+1)-1} \\ \end{aligned} \] and we find
\[ p|X=x \sim \text{Beta}(x+1, n-x+1) \]
n <- 10
p <- runif(1e5)
x <- rbinom(1e5, n, p)
x0 <- 3
z <- p[x==x0]
hist(z, 50, freq=FALSE)
curve(dbeta(x, x0+1, n-x0+1), -2, 2, add=T)
In a Bayesian analysis any inference is done from the posterior distribution. For example, point estimates can be found as the mean, median, mode or any other measure of central tendency:
say the following is a sample \(X_1,..,X_n\) from a normal with standard deviation \(\sigma=2.3\):
dta.norm
## [1] 1.1 1.8 2.9 3.3 4.3 4.5 4.6 5.1 5.2 5.3 5.8 6.2 6.6 6.7
## [15] 6.9 7.1 7.7 7.7 7.8 10.5
if we decide to base our analysis on the sample mean we have \(\bar X \sim N(\mu, \sigma/\sqrt{n})\). Now if we use the posterior mean we find
\[ E[\mu|X=x] = \frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2} \]
now we need to decide what a and b to use. If we have some prior information we can use that. Say we expect a priori that \(\mu=5\), and of course we know \(\sigma=2.3\), then we could use \(a=5\) and \(b=3\):
d <- (mean(dta.norm)/(2.3^2/20) + 5/3^2)/(1/(2.3^2/20) + 1/3^2)
d
## [1] 5.539
If we wanted to find a 95% interval estimate (now called a credible interval) we can find it directly from the posterior distribution:
cc <- 1/sqrt(1/(2.3^2/20) + 1/3^2)
cc
## [1] 0.5069
round(qnorm(c(0.025, 0.975), d, cc), 2)
## [1] 4.55 6.53
Note: the standard frequentist solution would be
round(mean(dta.norm), 2)
## [1] 5.55
round(mean(dta.norm)+c(-1, 1)*qt(0.975, 12)*2.3/sqrt(20), 2)
## [1] 4.43 6.68
Say in a class with 134 students 31 received an A. Find a 90% credible interval for the true percentage of students who get an A in this class.
round(qbeta(c(0.05, 0.95), 31 + 1, 134 - 31 + 1)*100, 1)
## [1] 17.8 29.7
let’s say that \(\mu\) is a physical quantity, like the mean amount of money paid on sales. In that case it makes more sense to use a prior that forces \(\mu\) to be non-negative. For example we could use \(\mu \sim \text{Gamma}(\alpha, \beta)\). However, now we need to find
\[ \begin{aligned} &m(x) = \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi \sigma^2}} e^{ -\frac1{2\sigma^2} (x-\mu)^2 } \frac1{\Gamma(\alpha)\beta^\alpha}\mu^{\alpha-1} e^{-\mu/\beta} d\mu \end{aligned} \]
and this integral does not exist. We will have to use numerical methods instead. Let’s again find a point estimate based on the posterior mean. As prior we will use \(\mu \sim \text{Gamma}(5, 1)\)
fmu <- function(mu)
dnorm(mean(dta.norm), mu, 2.3/sqrt(20))*
dgamma(mu, 5, 1)
mx <- integrate(fmu, lower=0, upper=Inf)$value
posterior.density <- function(mu) fmu(mu)/mx
posterior.mean <-
integrate(
function(mu) {mu*posterior.density(mu)},
lower = 0,
upper = Inf)$value
round(posterior.mean, 2)
## [1] 5.49
how about a 95% credible interval? This we need to solve the equations
\[ F(\mu|\textbf{x})=0.025\text{, }F(\mu|\textbf{x})=0.975 \]
where \(F\) is the posterior distribution function. Again we need to work numerically. We can use a simple bisection algorithm:
pF <- function(t) integrate(posterior.density,
lower=3, upper=t)$value
cc <- (1-0.95)/2
l <- 3
h <- posterior.mean
repeat {
m <- (l+h)/2
if(pF(m)<cc) l <- m
else h <- m
if(h-l<m/1000) break
}
left.endpoint <- m
h <- 8
l <- posterior.mean
repeat {
m <- (l+h)/2
if(pF(m)<1-cc) l <- m
else h <- m
if(h-l<m/1000) break
}
right.endpoint <- m
round(c(left.endpoint, right.endpoint), 2)
## [1] 4.50 6.47
Let’s generalize all this and write a routine that will find a point estimate and a \((1-\alpha)100\%\) credible interval for any problem with one parameter:
bayes.credint <- function(x, df, prior, conf.level=0.95, acc=0.001,
lower, upper, Show=TRUE) {
if(any(c(missing(lower), missing(upper))))
cat("Need to give lower and upper boundary\n")
posterior.density <- function(par, x) {
y <- 0*seq_along(par)
for(i in seq_along((par)))
y[i] <- df(x, par[i])*prior(par[i])/mx
y
}
mx <- 1
mx <- integrate(posterior.density,
lower=lower, upper=upper, x=x)$value
if(Show) {
par <- seq(lower, upper, length=250)
y <- posterior.density(par, x)
plot(par, y, type="l")
}
f.expectation <- function(par, x) par*posterior.density(par, x)
parhat <- integrate(f.expectation,
lower=lower, upper=upper, x=x)$value
if(Show) abline(v=parhat)
pF <- function(t, x) integrate(posterior.density,
lower=lower, upper=t, x=x)$value
cc <- (1-conf.level)/2
l <- lower
h <- parhat
repeat {
m <- (l+h)/2
if(pF(m, x)<cc) l <- m
else h <- m
if(h-l<acc*m) break
}
left.endpoint <- m
h <- upper
l <- parhat
repeat {
m <- (l+h)/2
if(pF(m, x)<1-cc) l <- m
else h <- m
if(h-l<acc*m) break
}
right.endpoint <- m
if(Show) abline(v=c(left.endpoint, right.endpoint))
c(parhat, left.endpoint, right.endpoint)
}
df <- function(x, par) dnorm(x, par, 2.3/sqrt(20))
prior <- function(par) dnorm(par, 5, 2.3)
round(bayes.credint(mean(dta.norm), df=df, prior=prior,
lower=3, upper=8, Show=T), 2)
## [1] 5.53 4.54 6.51
df <- function(x, par) dbinom(x, 134, par)
prior <- function(par) dunif(par)
round(100*bayes.credint(x=31, df=df, prior=prior, acc=0.0001,
conf.level = 0.9, lower=0.1, upper=0.4, Show=T), 1)
## [1] 23.5 17.8 29.7
which is the same before:
round(qbeta(c(0.05, 0.95), 31 + 1, 134 - 31 + 1)*100, 1)
## [1] 17.8 29.7
df <- function(x, par) dnorm(x, par, 2.3/sqrt(20))
prior <- function(par) dgamma(par, 5, 1)
round(bayes.credint(mean(dta.norm), df=df, prior=prior,
lower=4, upper=7, Show=TRUE), 2)
## [1] 5.49 4.51 6.46
Say we pick a coin from our pocket. We flip it 1000 time and get 578 heads. We want to find a 95% credible interval for the proportion of heads.
What would be good prior here? We might reason as follows: on the one had we are quite sure that indeed it is an “almost” fair coin. On the other hand if it is not a fair coin we really don’t know how unfair it might be. We can encode this in the Lincoln’s hat prior:
prior <- function(x) dunif(x) + dunif(x, 0.45, 0.55)
curve(prior, 0, 1, ylim=c(0, 11))
df <- function(x, par) dbinom(x, 1000, par)
round(bayes.credint(x=578, df=df, prior=prior, acc=0.0001,
lower=0.5, upper=0.65, Show=TRUE), 3)
## [1] 0.567 0.535 0.607
So, have we just solved the Bayesian estimation problem for one parameter?
consider the following sample:
dta.beta1 <- round(rbeta(1000, 2, 2), 3)
hist(dta.beta1, 50)
Let’s say we know that this is from a Beta(a, a) distribution and we want to estimate a. As a prior we want to use Gamma(2, 1)
Now what is df? Because this is an independent sample we find \(df(x, a) = \prod_{i=1}^n \text{dbeta}(x_i, a, a)\), so
df <- function(x, par) prod(dbeta(x, par, par))
prior <- function(par) dgamma(par, 2, 1)
round(bayes.credint(dta.beta1, df=df, prior=prior,
lower=1.5, upper=2.5, Show=TRUE), 2)
## [1] 2.07 1.91 2.24
so far, so good. But now
dta.beta2 <- round(rbeta(10000, 2, 2), 3)
bayes.credint(dta.beta2, df=df, prior=prior,
lower=1.5, upper=2.5, Show=TRUE)
## Error in integrate(posterior.density, lower = lower, upper = upper, x = x): non-finite function value
Why does this not work? The problem is that the values is \(\prod_{i=1}^n \text{dbeta}(x_i, a, a)\) get so small that R can’t handle them anymore!
Occasionally one can avoid this problem by immediately choosing a statistic T(x), aka a function of the data, so that T(X) has a distribution that avoids the product. That of course is just what we did above by going to \(\bar X\) in the case of the normal! In fact, it is also what we did in the case of the Binomial, because we replace the actual data (a sequence of Bernoulli trials) with their sum.
Can we generalize this to more than one parameter? In principle yes, but in practice no, at least not if the number of parameters is much more than 3 or 4. The main problem is the calculation of the marginal \(m(x)\), because numerical integration in higher-dimensional spaces is very difficult. In that case a completely different approach is used, namely sampling from the posterior distribution using so called MCMC (Markov Chain Monte Carlo) algorithms.
df <- function(x, par) dpois(sum(x), 10*par)
prior <- function(par) 1/sqrt(par)
x <- rpois(10, 5)
bayes.credint(x=37, df=df, prior=prior, lower=0, upper=10, Show=T)
## [1] 3.750 2.646 5.041