Our previous discussions focused on statistical methods that belong to the Frequentist School. There is however an entirely different approach to Statistics called Bayesian.
Say we pick a coin from our pocket. We flip it 10 times and get 3 heads. What can we conclude?
According to our previous discussion we would find that our best guess for the probability of heads is \(3/10=0.3\). But of course nobody would actually believe that, this is a regular coin and so almost certainly we have \(\pi=0.5\).
The issue is that we already know a lot about the experiment (the coin) before we ever do the experiment. However, Frequentist statistics does not allow us to make use of this knowledge. Bayesian statistics does!
There are other differences between these schools of statistics, all the way to the very meaning of the word probability.
Say we have a sample \(\textbf{x}=(x_1,..,x_n)\), iid from some probability density \(f(.; \theta)\), and we want to do some inference on the parameter \(\theta\).
A Bayesian analysis begins by specifying a prior distribution \(\pi(\theta)\). This prior is supposed to encode our knowledge of the parameter before an experiment is done. Then one uses Bayes’ formula to calculate the posterior distribution:
\[ f(\theta ; \textbf{x}) = f(\textbf{x}; \theta)\pi(\theta)/m(\textbf{x}) \]
where \(m(\textbf{x})\) is the marginal distribution
\[ m(\textbf{x}) = \int ..\int f(\textbf{x}|\theta)\pi(\theta) d \theta \]
Let’s say that we have two coins. One is a fair coin with \(\pi=0.5\) and the other is a loaded coin with \(\pi=0.6\). We randomly choose a coin and flip it 100 times. We get 58 heads. What can we say?
Now we have \(X \sim Bin(100, \pi)\), or
\[ P_\pi(X=58) = \\ {{100}\choose{58}}\pi^{58}(1-\pi)^{100-58} = \\ K\pi^{58}(1-\pi)^{42} \] and \(\pi=0.5\) or \(\pi=0.6\) with probability \(1/2\).
The marginal can be found using the law of total probability
\[ \begin{aligned} &m(58) = \\ &P_{\pi=0.5}(X=58)P(\pi=0.5)+P_{\pi=0.6}(X=58)P(\pi=0.6)=\\ & K0.5^{58}(1-0.5)^{42}\frac12 + K0.6^{58}(1-0.6)^{42}\frac12 = \\ &K/2\left\{ 0.5^{100} + 0.6^{58}0.4^{42}\right\} \\ \end{aligned} \]
and the posterior distribution is given by
\[ \begin{aligned} &P_{x=58}(\pi=0.5) = \\ &\frac{P_{\pi=0.5}(X=58)P(\pi=0.5)}{m(58)} = \\ &\frac{K0.5^{58}(1-0.5)^{42}1/2}{K/2\left\{ 0.5^{100} + 0.6^{58}0.4^{42}\right\}} = \\ &\frac{0.5^{100}}{0.5^{100} + 0.6^{58}0.4^{42}}\\ \end{aligned} \] and so
round(0.5^100/(0.5^100+ 0.6^58*0.4^42), 4)
## [1] 0.231
and so the probability that this as the fair coin is 0.231.
Let’s assume we have no idea what \(\pi\) might be, then a uniform distribution might make good sense as a prior:
\[ X \sim Bin(n, \pi), \pi \sim U[0,1] \]
now we find
\[ \begin{aligned} &m(x) = \int_{-\infty}^\infty f(x|\mu)\pi(\mu) d\mu = \\ &\int_0^1 {{n}\choose{x}}p^x(1-p)^{n-x}1dp =\\ &K_1\int_0^1 K_2 p^{(x+1)-1} (1-p)^{(n-x+1)-1}dp = K_3\\ \end{aligned} \]
because this is (for the right value of \(K_2\)) a Beta density which will integrate to 1. So
\[ \begin{aligned} &f(\theta; \textbf{x}) = f(\textbf{x};\theta)\pi(\theta)/m(\textbf{x}) = \\ & K_4 \pi^{(x+1)-1} (1-\pi)^{(n-x+1)-1} \\ \end{aligned} \] and we find
\[ \pi|X=x \sim \text{Beta}(x+1, n-x+1) \] Let’s see what this looks like:
ggcurve(fun=function(x) dbeta(x, 58+1, 100-58+1), A=0.4, B=0.75)
In the above calculation we could ignore the exact values of the constants because in the end we found that the posterior distribution had a known form. This is always the case if we choose the prior to be the conjugate prior.
\(X \sim N(\mu, \sigma)\) independent, \(\sigma\) known, \(\mu \sim N(a, b)\).
Now
\[ \begin{aligned} &m(x) = \int_{-\infty}^\infty f(x|\mu)\pi(\mu) d\mu = \\ &\int_{-\infty}^\infty \frac{1}{\sqrt{2\pi \sigma^2}} e^{ -\frac1{2\sigma^2} (x-\mu)^2 } \frac{1}{\sqrt{2\pi b^2}} e^{ -\frac1{2 b^2} (\mu-a)^2 } d\mu \\ \end{aligned} \]
Note
\[ \begin{aligned} &(x-\mu)^2/\sigma^2 + (\mu-a)^2/b^2 = \\ &x^2/\sigma^2 - 2x\mu/\sigma^2 + \mu^2/\sigma^2 + \mu^2/b^2 - 2a\mu/b^2 +a^2/b^2 \\ &(1/\sigma^2+1/b^2)\mu^2 -2(x/\sigma^2+a/b^2 )\mu + K_1 = \\ &(1/\sigma^2+1/b^2) \left( \mu^2 -2 \frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2}\mu \right) +K_2 =\\ &\frac{(\mu-d)^2}{c^2}+K_3 \end{aligned} \]
where \(d=\frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2}\) and \(c=1/\sqrt{1/\sigma^2+1/b^2}\)
therefore
\[ \begin{aligned} &m(x) = K_4 \int_{-\infty}^\infty e^{ -\frac1{2c^2} (\mu-d)^2 } d \mu= K_5\\ \end{aligned} \]
because the integrand is a normal density with mean d and standard deviation c, so it will integrate to 1 as long as the constants are correct.
\[ \begin{aligned} &f(\theta| \textbf{x}) = f(\textbf{x}|\theta)\pi(\theta)/m(\textbf{x}) = \\ & K_6 e^{ -\frac1{2c^2} (\mu-d)^2 } \\ \end{aligned} \] Notice that we don’t need to worry about what exactly \(K_6\) is, because the posterior will be a proper probability density, so \(K_6\) will be what it has to be!
So we found
\[ \mu|X=x \sim N\left(\frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2}, 1/\sqrt{1/\sigma^2+1/b^2}\right) \] Let’s so a little simulation to see whether we got this right:
a <- 0.2 # just as an example
b <- 2.3
sigma <- 0.5
mu <- rnorm(1e5, a, b)
x <- rnorm(1e5, mu, sigma)
cc <- 1/sqrt(1/sigma^2 + 1/b^2)
x0 <- (-1.1)
d <- (x0/sigma^2+a/b^2)/(1/sigma^2 + 1/b^2)
z <- mu[x>x0-0.05 & x<x0+0.05]
bw <- diff(range(z))/50
ggplot(data.frame(x=z), aes(x)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
stat_function(fun = dnorm,
colour = "blue",
args=list(mean=d, sd=cc))
Note that one thing a frequentist and a Bayesian analysis have in common is the likelihood function.
In a Bayesian analysis any inference is done from the posterior distribution. For example, point estimates can be found as the mean, median, mode or any other measure of central tendency.
Interval estimates (now called credible intervals) can be found using quantiles of the posterior distribution.
we found the posterior distribution to be
\[ \pi|X=x \sim \text{Beta}(x+1, n-x+1) \] From probability theory we know that if \(Y \sim \text{Beta}(\alpha, \beta)\) we have \(EY=\frac{\alpha}{\alpha+ \beta}\), so here we find \(\hat{\pi}=\frac{y+1}{n+2}\). Recall that the frequentist solution (the mle) was \(\hat{\pi}=\frac{y}{n}\).
Recall the survey of 567 people, 235 said they prefer Coke over Pepsi. A \(95\%\) credible interval for the true proportion is given by
ci <- qbeta(c(0.025, 0.975), 235+1, 567-235+1)
round(ci, 3)
## [1] 0.375 0.455
The frequentist confidence interval was
phat <- 235/567
round(phat + c(-1, 1)*qnorm(0.975)*sqrt(phat*(1-phat)/567), 3)
## [1] 0.374 0.455
and we see the two are quite close. This tends to be true as long as there is enough data.
say the following is a sample \(x_1,..,x_n\) from a normal with standard deviation \(\sigma=2.3\):
dta.norm
## [1] 2.1 2.8 3.3 3.5 4.0 4.9 5.4 5.5 5.5 5.7 6.0 6.3 6.3 6.3 6.5 7.0 7.1
## [18] 7.3 7.3 7.4
if we decide to base our analysis on the sample mean we have \(\bar X \sim N(\mu, \sigma/\sqrt{n})\). Now if we use the posterior mean we find
\[ E[\mu|X=x] = \frac{x/\sigma^2+a/b^2}{1/\sigma^2+1/b^2} \]
now we need to decide what a and b to use. If we have some prior information we can use that. Say we expect a priori that \(\mu=5\), and of course we know \(\sigma=2.3\), then we could use \(a=5\) and \(b=3\):
d <- (mean(dta.norm)/(2.3^2/20) + 5/3^2)/(1/(2.3^2/20) + 1/3^2)
round(d, 2)
## [1] 5.5
A 95% credible interval is:
cc <- 1/sqrt(1/(2.3^2/20) + 1/3^2)
round(qnorm(c(0.025, 0.975), d, cc), 2)
## [1] 4.50 6.49
the standard frequentist solution would be
round(mean(dta.norm)+c(-1, 1)*qt(0.975, 12)*2.3/sqrt(20), 2)
## [1] 4.39 6.63
let’s say that \(\mu\) is a physical quantity, like the mean amount of money paid on sales. In that case it makes more sense to use a prior that forces \(\mu\) to be non-negative. For example we could use \(\mu \sim \text{Gamma}(\alpha, \beta)\). However, now we need to find
\[ \begin{aligned} &m(x) = \int_{-\infty}^\infty \frac{1}{\sqrt{2\pi \sigma^2}} e^{ -\frac1{2\sigma^2} (x-\mu)^2 } \frac1{\Gamma(\alpha)\beta^\alpha}\mu^{\alpha-1} e^{-\mu/\beta} d\mu \end{aligned} \]
and this integral does not exist. We will have to use numerical methods instead. Let’s again find a point estimate based on the posterior mean. As prior we will use \(\mu \sim \text{Gamma}(5, 1)\)
fmu <- function(mu)
dnorm(mean(dta.norm), mu, 2.3/sqrt(20))*
dgamma(mu, 5, 1)
mx <- integrate(fmu, lower=0, upper=Inf)$value
posterior.density <- function(mu) fmu(mu)/mx
posterior.mean <-
integrate(
function(mu) {mu*posterior.density(mu)},
lower = 0,
upper = Inf)$value
round(posterior.mean, 2)
## [1] 5.44
how about a 95% credible interval? This we need to solve the equations
\[ F(\mu)=0.025\text{, }F(\mu)=0.975 \]
where \(F\) is the posterior distribution function. Again we need to work numerically. We can use a simple bisection algorithm:
pF <- function(t) integrate(posterior.density,
lower=3, upper=t)$value
cc <- (1-0.95)/2
l <- 3
h <- posterior.mean
repeat {
m <- (l+h)/2
if(pF(m)<cc) l <- m
else h <- m
if(h-l<m/1000) break
}
left.endpoint <- m
h <- 8
l <- posterior.mean
repeat {
m <- (l+h)/2
if(pF(m)<1-cc) l <- m
else h <- m
if(h-l<m/1000) break
}
right.endpoint <- m
round(c(left.endpoint, right.endpoint), 2)
## [1] 4.45 6.44
Let’s generalize all this and write a routine that will find a point estimate and a \((1-\alpha)100\%\) credible interval for any problem with one parameter:
bayes.credint <- function(x, df, prior, conf.level=0.95, acc=0.001,
lower, upper, Show=TRUE) {
if(any(c(missing(lower), missing(upper))))
cat("Need to give lower and upper boundary\n")
posterior.density <- function(par, x) {
y <- 0*seq_along(par)
for(i in seq_along((par)))
y[i] <- df(x, par[i])*prior(par[i])/mx
y
}
mx <- 1
mx <- integrate(posterior.density,
lower=lower, upper=upper, x=x)$value
if(Show) {
par <- seq(lower, upper, length=250)
y <- posterior.density(par, x)
df <- data.frame(x=par, y=y)
plt <- ggplot(df, aes(x, y)) +
geom_line(color="blue", size=1.2)
}
f.expectation <- function(par, x) par*posterior.density(par, x)
parhat <- integrate(f.expectation,
lower=lower, upper=upper, x=x)$value
pF <- function(t, x) integrate(posterior.density,
lower=lower, upper=t, x=x)$value
cc <- (1-conf.level)/2
l <- lower
h <- parhat
repeat {
m <- (l+h)/2
if(pF(m, x)<cc) l <- m
else h <- m
if(h-l<acc*m) break
}
left.endpoint <- m
h <- upper
l <- parhat
repeat {
m <- (l+h)/2
if(pF(m, x)<1-cc) l <- m
else h <- m
if(h-l<acc*m) break
}
right.endpoint <- m
if(Show)
print(plt +
geom_vline(xintercept=
c(left.endpoint, parhat, right.endpoint)))
c(parhat, left.endpoint, right.endpoint)
}
df <- function(x, par) dnorm(x, par, 2.3/sqrt(20))
prior <- function(par) dnorm(par, 5, 2.3)
round(bayes.credint(mean(dta.norm), df=df, prior=prior,
lower=3, upper=8, Show=TRUE), 2)
## [1] 5.49 4.50 6.47
df <- function(x, par) dnorm(x, par, 2.3/sqrt(20))
prior <- function(par) dgamma(par, 5, 1)
round(bayes.credint(mean(dta.norm), df=df, prior=prior,
lower=4, upper=7, Show=TRUE), 2)
## [1] 5.44 4.47 6.42
Say we pick a coin from our pocket. We flip it 1000 time and get 578 heads. We want to find a 95% credible interval for the proportion of heads.
What would be good prior here? We might reason as follows: on the one had we are quite sure that indeed it is an “almost” fair coin. On the other hand if it is not a fair coin we really don’t know how unfair it might be. We can encode this in the Lincoln’s hat prior:
prior <- function(x) dunif(x) + dunif(x, 0.45, 0.55)
ggcurve(fun=prior, A=0, B=1)
df <- function(x, par) dbinom(x, 1000, par)
round(bayes.credint(x=578, df=df, prior=prior, acc=0.0001,
lower=0.5, upper=0.65, Show=TRUE), 3)
## [1] 0.567 0.535 0.607
So, have we just solved the Bayesian estimation problem for one parameter?
consider the following sample:
dta.beta1 <- round(rbeta(100, 2, 2), 3)
bw <- diff(range(dta.beta1))/50
ggplot(data.frame(x=dta.beta1), aes(x)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "", y = "")
Let’s say we know that this is from a Beta(a, a) distribution and we want to estimate a. As a prior we want to use Gamma(2, 1)
Now what is df? Because this is an independent sample we find \[ f(x, a) = \prod_{i=1}^n \text{dbeta}(x_i, a, a) \] so
df <- function(x, par) prod(dbeta(x, par, par))
prior <- function(par) dgamma(par, 2, 1)
round(bayes.credint(dta.beta1, df=df, prior=prior,
lower=1.5, upper=2.5, Show=TRUE), 2)
## [1] 2.09 1.65 2.46
so far, so good. But now
dta.beta2 <- round(rbeta(10000, 2, 2), 3)
bayes.credint(dta.beta2, df=df, prior=prior,
lower=1.5, upper=2.5, Show=FALSE)
## Error in integrate(posterior.density, lower = lower, upper = upper, x = x): non-finite function value
Why does this not work? The problem is that the values is \(\prod_{i=1}^n \text{dbeta}(x_i, a, a)\) get so small that R can’t handle them anymore!
Occasionally one can avoid this problem by immediately choosing a statistic T(x), aka a function of the data, so that T(X) has a distribution that avoids the product. That of course is just what we did above by going to \(\bar X\) in the case of the normal! In fact, it is also what we did in the case of the Binomial, because we replaced the actual data (a sequence of Bernoulli trials) with their sum. It is however not clear what one could use in the case of the Beta distribution.
Can we generalize this to more than one parameter? In principle yes, but in practice no, at least not if the number of parameters is much more than 3 or 4. The main problem is the calculation of the marginal \(m(x)\), because numerical integration in higher-dimensional spaces is very difficult. In that case a completely different approach is used, namely sampling from the posterior distribution using so called MCMC (Markov Chain Monte Carlo) algorithms.
Another difficulty arises in the choice of priors. There are a number of different approaches known for low-dimensional problems, however these can fail badly in higher dimensions.
Example
Let’s do a Bayesian analysis to find interval estimates for the parameters in the normal mixture model for the Old Faithful data. As priors we will use \(\alpha \sim U[0, 1], \mu_1 \sim 1, \mu_2 \sim 1, \sigma_1 \sim 1, \sigma_2 \sim 1\).
The following routine implements the Metropolis-Hastings algorithm for sampling from the posterior distribution and then finds intervals from the sample quantiles:
old.faithfull.MCMC <- function(B=1e4) {
eps <- c(0.01, 1, 0.1, 1, 0.1)
x <- faithful$Waiting.Time
h <- function(p1, p2) {
tmp1 <- p1[1]*dnorm(x, p1[2], p1[3]) +
(1-p1[1])*dnorm(x, p1[4], p1[5])
tmp2 <- p2[1]*dnorm(x, p2[2], p2[3]) +
(1-p2[1])*dnorm(x, p2[4], p2[5])
sum(log(tmp1)) - sum(log(tmp2))
}
param <- matrix(0, B, 5)
param[1, ] <- c(0.35, 55, 5.4, 80, 5.9)
for(i in 2:B) {
new.param <- param[i-1, ] + runif(5, -eps, eps)
if(log(runif(1))<h(new.param, param[i-1, ]))
param[i, ] <- new.param
else
param[i, ] <- param[i-1, ]
}
param
}
param <- old.faithfull.MCMC()
round(apply(param[-c(1:1000), ], 2, quantile, probs=c(0.025, 0.975)), 3)
## [,1] [,2] [,3] [,4] [,5]
## 2.5% 0.301 53.218 5.295 78.989 5.247
## 97.5% 0.423 56.179 7.325 81.098 6.802
There are a number of R packages that allow Bayesian analysis, such as JAGS, OpenBUGS, WinBUGS and Stan. However, we don’t have enough time to discuss these.