In the classical, or frequentist approach to Statistics we consider a parameter \(\theta\) a fixed although unknown quantity. A random sample X1, .., Xn is drawn from a population indexed by \(\theta\) and, based on the observed values in the sample, knowledge about the true value of \(\theta\) is obtained.
In the Bayesian approach \(\theta\) is considered a quantity whose variation can be described by a probability distribution (called a prior distribution), which is a subjective distribution describing the experimenters belief and which is formulated before the data is seen. A sample is then taken from a population indexed by \(\theta\) and the prior distribution is updated with this new information. The updated distribution is called the posterior distribution. This updating is done using Bayes’ formula, hence the name Bayesian Statistics.
It is not that Bayesians don’t think a parameter is a fixed number. Rather it is our lack of knowledge about the parameter that is described by the prior distribution.
Let’s denote the prior distribution by \(\pi(\theta)\) and the sampling distribution by \(f(\bf{x}|\theta)\), then the joint density of X and \(\theta\) is given by
\[ f(\bf{x}, \theta) = f(\pmb{x} | \theta)\pi(\theta) \]
the marginal of the distribution of X is \[ m(\pmb{x})=\int f(\pmb{x}|\theta)\pi(\theta)d\theta \]
and finally the posterior distribution is the conditional distribution of \(\theta\) given the sample \(\pmb{x}\) and is given by \[ \pi(\theta|\pmb{x}) = \frac{f(\pmb{x}|\theta)\pi(\theta)}{m(\pmb{x})} \]
Note the distribution \(m(\pmb{x})\) is often called the prior predictive distribution.
You want to see whether it is really true that coins come up heads and tails with probability 1/2. You take a coin from your pocket and flip it 10 times. It comes up heads 3 times. As a frequentist we would now use the sample mean as an estimate of the true probability of heads, p and find \(\hat{p} = 0.3\).
A Bayesian analysis would proceed as follows: let \(X_1, ..., X_n\sim Ber(p)\). Then \(Y= X_1+ .. +X_n\sim Bin(n,p)\). Now we need a prior on p. Of course p is a probability, so it has values on [0,1]. One distribution on [0,1] we know is the Beta, so we will use a Beta(\(\alpha\),\(\beta\)) as our prior. Remember, this is a perfectly subjective choice, and anybody can use their own.
We find
\[ \begin{aligned} &f(y,p) =\left[{n\choose y}p^y(1-p)^{n-y}\right]\left[\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}\right] =\\ &{n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{y+\alpha-1}(1-p)^{n-y+\beta-1} \end{aligned} \] for the marginal we find
\[ \begin{aligned} &m(\pmb{x}) =\int_{-\infty}^{\infty} f(y,p)dp \\ &\int_0^1 {n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{y+\alpha-1}(1-p)^{n-y+\beta-1} dp = \\ &{n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\int_0^1p^{y+\alpha-1}(1-p)^{n-y+\beta-1} dp = \\ &{n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\frac{\Gamma(y+\alpha)\Gamma(n-y+\beta)}{\Gamma(n+\alpha+\beta)}\times\\ &\int_0^1\frac{\Gamma(n+\alpha+\beta)}{\Gamma(y+\alpha)\Gamma(n-y+\beta)}p^{y+\alpha-1}(1-p)^{n-y+\beta-1} dp = \\ &{n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\frac{\Gamma(y+\alpha)\Gamma(n-y+\beta)}{\Gamma(n+\alpha+\beta)} \end{aligned} \] because the integral is over a \(Beta(y+\alpha,n-y+\beta)\) density and therefore equal to 1. This is known as the beta-binomial distribution.
Finally the posterior distribution of \(p|\pmb{X=x}\) is
\[ \begin{aligned} &\pi(p|\pmb{x}) = \frac{f(\pmb{x}|p)\pi(p)}{m(\pmb{x})} = \\ &\frac{{n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{y+\alpha-1}(1-p)^{n-y+\beta-1}}{{n\choose y}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\frac{\Gamma(y+\alpha)\Gamma(n-y+\beta)}{\Gamma(n+\alpha+\beta)}} = \\ &\frac{\Gamma(n+\alpha+\beta)}{\Gamma(y+\alpha)\Gamma(n-y+\beta)}p^{y+\alpha-1}(1-p)^{n-y+\beta-1} \\ \end{aligned} \]
and we find that \(p|\pmb{X=x}\sim Beta(y+\alpha,n-y+\beta)\)
Of course we still need to “extract” some information about the parameter p from the posterior distribution. Once the sampling distribution and the prior are chosen, the posterior distribution is fixed (even though it may not be easy or even possible to find it analytically), but how we proceed now is completely open and there are in general many choices. If we want to estimate p a natural estimator is the mean of the posterior distribution, given here by
\[\hat{p}_B = (y+\alpha)/(\alpha+\beta+n)\]
This can be written as
\[\hat{p}_B=\left(\frac{n}{\alpha+\beta+n}\right)(\frac{y}{n})+\left(\frac{\alpha+\beta+n}{\alpha+\beta+n}\right)(\frac{\alpha}{\alpha+\beta})\]
and we see that the posterior mean is a linear combination of the prior mean and the sample mean.
How about our problem with the 3 heads in the 10 flips? Well, we have to completely specify the prior distribution, that is we have to choose \(\alpha\) and \(\beta\). The choice depends again on our belief. For example, if we feel strongly that this coin is just like any other coin and therefore really should be a fair coin we should choose them so that the prior puts almost all its weight at around 1/2. For example, with \(\alpha=\beta=100\) we get E[p]=0.5 and var(p)=0.0016. Then
\[\hat{p}_B = (3+100)/(100+100+10) = 0.4905\]
is our estimate for the probability of heads. Clearly for such a strong prior the actual sample almost does not matter, For example for y=0 we would have found \(\hat{p}_B = 0.476\) and for y=10 it would be \(\hat{p}_B = 0.524\).
Maybe we have never even heard the word “coin” and have no idea what one looks like, let alone what probability of “heads” might be. Then we could choose \(\alpha=\beta=1\), that is the uniform distribution, as our prior. Really this would indicate our complete lack of knowledge regarding p. (this is called an uninformative prior). Now we find \(\hat{p}_B = (3+1)/(1+1+10) = 0.3\), which is just the sample mean again.
Let’s study the effects of the different parts of our estimator:
df <- data.frame(n=10*1:50,
y=(3*1:50 + 100)/(10*1:50 + 200))
ggplot(data=df, aes(n, y)) +
geom_point() +
labs(x="Sample Size", y="phat")
so as the sample size increases, the estimate moves from close to 0.5 (the prior mean) to 0.3 (the mean of the data). The more data we have the less influence the prior has.
df <- data.frame(alpha=1:100,
y=(3+1:100)/(10+2*1:100))
ggplot(data=df, aes(alpha, y)) +
geom_point() +
labs(x=expression(alpha), y="phat")
A larger alpha means a prior more concentrated around 1/2. The “stronger” the prior, the more it influences the estimate.
say \(X\sim Bin(n,p)\), p known. Again, a Bayesian analysis begins with a prior on n. Now n=1, 2, .. and so a prior is any sequence \(a_1, a_2, ...\) with
\[a_i \ge 0\text{ and }\sum a_i=1\]
Let \(y=\sum_{i=1}^n x_i\) be the number of successes, then
\[ \begin{aligned} &f(x,n) =\left[{n\choose x}p^x(1-p)^{n-x}\right]a_n \\ &m(\pmb{x}) =\sum_{n=y}^\infty \left[{n\choose x}p^x(1-p)^{n-x}\right]a_n \\ &\pi(n|\pmb{x}) = \frac{f(\pmb{x}|p)\pi(p)}{m(\pmb{x})} = \\ &\frac{{n\choose x}p^x(1-p)^{n-x}a_n}{\sum_{n=y}^\infty {n\choose x}p^x(1-p)^{n-x}a_n} \end{aligned} \]
and this sum can not be found analytically.
If we want to find an estimate for n we can use for example the mode, that is the n which has the largest posterior probability.
Here are some specific examples: say we observe x=217 and we know p=0.37. Also
The routine bayes.bin.n draws the prior and the posterior curve and finds the mode:
bayes.bin.n <- function (x, p, a) {
N <- length(a)
q <- 1 - p
f <- rep(0, N)
f[x:N] <- choose(x:N, x) * q^(c(x:N) - x) * a[x:N]
mx <- sum(choose(x:N + x, x) * q^c(x:N) * a[x:N])
f <- f/mx
mode <- c(1:N)[which.max(f)]
df <- data.frame(
n=c(1:N, 1:N),
y=c(a, f),
Which=rep(c("prior", "posterior"), each=N))
print(ggplot(data=df, aes(n, y, color=Which)) +
geom_point() +
geom_vline(xintercept = mode))
mode
}
bayes.bin.n(217, 0.37, rep(1/750,750))
## [1] 586
bayes.bin.n(217, 0.37, dnorm(1:750,500,50))
## [1] 562
bayes.bin.n(217, 0.37, ifelse(c(1:750)%%50,0,1))
## [1] 600
n <- c(510, 525, 550, 575)
a <- rep(0, 750)
a[n] <- 1
bayes.bin.n(217, 0.37, a)
## [1] 575
As we can see, the Bayesian method lets us include such knowledge in a very simple manner!
Should you be a Bayesian?
Bayesian Statistics has a lot of good features. To begin with, it answers the right question, P(Hypothesis|Data). There are others as well:
There is a branch of mathematics concerned with decision making. It is conceptually a very useful and important one:
Should you buy a new car, or keep the old one for another year?
Should you invest your money into the stock market or buy fixed-interest bonds?
Should the government lower the taxes or instead use the taxes for direct investments?
In decision theory one starts out by choosing a loss function, that is a function that assigns a value (maybe in terms of money) to every possible action and every possible outcome.
You are offered the following game: you can either take $10 (let’s call this action a), or you can flip a coin (action b). If the coin comes up heads you win $50, if it comes up heads you loose $10. So there are two possible actions: take the $10 or flip the coin, and three possible outcomes, you win $10, $50 or loose $10.
We need a value for each combination. One obvious answer is this one:
L(a)=10, L(b,“heads”)=50, L(b,“tails”)=-10
But there are other possibilities. Say you are in a bar. You already had food and drinks and your tab is $27. Now you notice that you only have $8 in your pocket (and no credit card etc.) Now if you win or loose $10 it doesn’t matter, either way you can’t pay your bill, and it will be very embarrassing when it comes to paying. But if you win $50, you are fine. Now your loss function might be:
L(a)=0, L(b,“heads”)=1000, L(b,“tails”)=0
The next piece in decision theory is the decision function. The idea is this: let’s carry out an experiment, and depending on the outcome of the experiment we chose an action.
Let’s do this: we wait until tomorrow. If the Dow Jones goes up, we invest in the stock market, otherwise in bonds.
In decision theory a decision rule is called inadmissible if there is another rule that is better no matter what the outcome of the experiment. Obviously it makes no sense to pick an inadmissible rule.
So what’s the connection to Bayesian Statistics? First there are Bayesian decision rules, which combine prior knowledge with the outcome of the experiment.
Now there is a famous theorem (the complete class theorem) that says that all admissible rules are Bayesian decision rules for some prior.
Optimality
Obviously when we do something it would be nice to do it in an optimal (best) way. It turns out that in Bayesian statistics it is often possible to show that a certain method is best, better or at least as good as any other.
Should you be a Frequentist?
There are also arguments in favor of Frequentist statistics:
priors are bad
or better to say you don’t like the subjectivity introduced by priors. In Bayesian statistics it is entirely possible that two Scientists who have the same data available and use the same method for analysis come to different conclusions, because they have different priors.
Frequentist methods work
For most of the history of Statistics, that is from about 1900 to about 1960, there was (essentially) only Frequentist Statistics. In this time (and since) many methods have been developed that worked very well in practice. Many of those turn out to be also Bayesian methods when the right prior is used, but not all!
one of the most useful modern methods, called the Bootstrap, is a purely Frequentist method with no Bayesian theory. (Actually there is something called the Bayesian bootstrap, but it is not the same as the classical bootstrap)
A standard technic in regression is to study the residuals. This, though, violates the likelihood principle and is therefore not allowed under the Bayesian paradigm. Actually, most Bayesians study the residuals anyway.
Frequentist methods are often fairly simple
Even for the easiest problems (“estimate the mean GPA of students at the Colegio”) a Bayesian analysis always seems to be complicated (choose a prior and a loss function, calculate the posterior, extract the estimate from the posterior, try to do all of this optimally) Frequentist solutions are often quick and easy.
So? Be Both!
Often in any a specific problem, one approach just makes more sense than the other.