The Likelihood Function and Likelihood Principle

Likelihood Function

One of the most important “objects” in Statistics is the likelihood function defined as follows:

Let \(\pmb{X}=(X_1, .., X_n)\) be a random vector with joint density f(\(\pmb{x}\)|\(\theta\)). then the likelihood function L is defined as

\[ L(\theta | \pmb{x})=f(\pmb{x}| \theta) \]

This must be one of the most deceivingly simple equations in math, actually it seems to be just a change in notation: L instead of f. What really matters and makes a huge difference is that in the density we consider the x’s as variables and the \(\theta\) as fixed, whereas in the likelihood function we consider the \(\theta\) as the variable(s) and the x’s as fixed. Essentially we have:

  • \(f(\pmb{x}| \theta)\) tells what we can expect to happen when we do an experiment.

  • \(L(\theta | \pmb{x})\) tells us something about the parameter(s) after the experiment is done.

Things simplify a bit if \(X_1, .., X_n\) is an iid sample. Then the joint density is given by

\[f(\pmb{x} | \theta) = \prod f(x_i| \theta)\]

Example (2.3.1)

\(X_1, .., X_n\sim Ber(p)\):

\[ \begin{aligned} & f(\pmb{x} | p) =\prod_{i=1}^n p^{x_i}(1-p)^{1-x_i}= \\ &p^{\sum x_i}(1-p)^{n-\sum x_i}\\ &L(p|\pmb{x}) = p^{\sum x_i}(1-p)^{n-\sum x_i} \end{aligned} \]

Example (2.3.2)

\(X_1, .., X_n\sim N(\mu,\sigma)\):

\[ \begin{aligned} &L(\mu,\sigma|\pmb{x}) = \prod_{i=1}^n \frac1{\sqrt{2\pi \sigma^2}}\exp\{-\frac12 \frac{(x_i-\mu)^2}{\sigma^2} \} = \\ &(2\pi \sigma^2)^{-n/2}\exp\{-\frac1{2\sigma^2} \sum (x_i-\mu)^2 \} \end{aligned} \]

Example (2.3.3)

\(X_1, .., X_n\sim \Gamma(\alpha,\beta)\):

\[ \begin{aligned} &L(\alpha,\beta|\pmb{x}) = \prod_{i=1}^n \frac1{\Gamma(\alpha)\beta^\alpha}x_i^{\alpha-1}\exp\{-x_i/\beta\}=\\ &\frac1{\Gamma(\alpha)^n\beta^{n\alpha}}(\prod_{i=1}^n x_i)^{\alpha-1}\exp\{-(\sum x_i)/\beta\} \end{aligned} \]

Example (2.3.4)

\(Y_1\sim N(\mu_1,\sigma_1), Y_2\sim N(\mu_2,\sigma_2), Z\sim Ber(p)\) and \(X=(1-Z)Y_1+ZY_2\).

\[L(p,\mu_1,\sigma_1,\mu_2,\sigma_1|\pmb{x})=\\ \prod_{i=1}^n \left\{ \frac{p}{\sqrt{2\pi\sigma_1^2}}\exp\left(-\frac{(x_i-\mu_1)^2}{2\sigma_1^2}\right)+\frac{1-p}{\sqrt{2\pi\sigma_1^2}}\exp\left(-\frac{(x_i-\mu_2)^2}{2\sigma_1^2}\right) \right\}\]

Example (2.3.5)

An urn contains N balls. ni of these balls have the number “i” on them, i=1,..,k and \(\sum n_i=N\). Say we randomly select a ball from the urn, note its number and put it back. We repeat this m times. Let the rv Xi be the number of balls with the number i that were drawn, and let \(\pmb{X}=(X_1, .., X_k)\). Now the density of X is given

\[f(\pmb{x}|m,n_1,..,n_k)= \frac{m!}{x_1!\cdot..\cdot x_k!}\prod_{i=1}^k \left(\frac{n_i}{N}\right)^{x_i}\]

for any \(x_1,..,x_k\) with \(x_i \in \{0,1,..,N\}\) and \(\sum x_i=m\)

Now let’s assume we don’t know \(n_1,..n_k\) and want to estimate them. First we can make a slight change in the parameterization: \(p_i=n_i/N\), i=1,..,k. The resulting random vector is called the multinomial rv with parameters m, \(p_1, .., p_k\).

Note if k=2 \(X_1\sim Bin(m,p_1)\) and \(X_2\sim Bin(m,p_2)\), so the multinomial is a generalization of the binomial.

The likelihood function is given by

\[L(m,p_1,..,p_k|\pmb{x})= \frac{m!}{x_1!\cdot..\cdot x_k!}\prod_{i=1}^k p_i^{x_i}\]

where \(p_1+..+p_k=1\) and \(x_1+..+x_k=m\).


There is a common misconception about the likelihood function: because it is the same as the density it has the same properties. This is not true because the likelihood function is a function of the parameters, not the variables.

Example (2.3.6)

\(X\sim Ber(p)\), so \(f(x)=p^x(1-p)^{1-x}\), \(x=0,1; 0<p<1\)

As a function of x with a fixed p we have \(f(x) \ge 0\) for all x and f(0)+f(1)=1 but as a function of p with a fixed x, say x=1, we have

\[\int_{-\infty}^{\infty} L(p|1) dp=\int_0^1pdp=\frac12\]


Log Likelihood

It turns out that for many problems the log of the likelihood function is more manageable entity, mainly because it turns the product into a sum:

Example (2.3.7)

\(X_1, .., X_n\sim Ber(p)\)

\[ \begin{aligned} &l(p|\pmb{x}) = \log L(p|\pmb{x}) =\\ &\log \left(p^{\sum x_i}(1-p)^{n-\sum x_i}\right)=\\ &(\sum x_i)\log p + (n-\sum x_i)\log (1-p) \end{aligned} \]

(worry about xi=0 for all i or xi=1 for all i yourself)

n <- 100; pi <- 0.5
dta <- rbinom(n, 1, pi)
loglike_pi <- function(p) {
  y <- 0*p
  for(i in seq_along(p)) 
    y[i] <- sum(dta)*log(p[i])+(n-sum(dta))*log(1-p[i])
  y
}
ggcurve(fun=loglike_pi, A=0.4, B=0.6)

Example (2.3.8)

\(X_1, .., X_n\sim N(\mu,\sigma)\):

\[ \begin{aligned} &l(\mu,\sigma|\pmb{x})= \log L(\mu,\sigma|\pmb{x}) =\\ &\log \left( 2\pi \sigma^2)^{-n/2}\exp\{-\frac1{2\sigma^2} \sum (x_i-\mu)^2 \}\right) =\\ &-\frac{n}2 \log(2\pi \sigma^2)-\frac1{2\sigma^2} \sum (x_i-\mu)^2 \end{aligned} \]

This log-likelihood function is drawn here

  • as a function of \(\mu\), with \(\sigma\) fixed:
n <- 100; mu <- 5; sig <- 2
dta <- rnorm(n, mu, sig)
loglike_mu <- function(x) {
  y <- x
  for(i in seq_along(x))
    y[i] <- (-n/2*log(2*pi*sig^2)-sum((dta-x[i])^2)/(2*sig^2))
  y
}
ggcurve(fun=loglike_mu, A=4, B=6.5)

  • as a function of \(\sigma\), with \(\mu\) fixed:
loglike_sig <- function(x) {
  y <- x
  for(i in seq_along(x)) 
    y[i] <- 
      (-n/2*log(2*pi*x[i]^2)-sum((dta-mu)^2)/(2*x[i]^2))
  y
}
ggcurve(fun=loglike_sig, A=0.75, B=20)

Example (2.3.9)

\(X_1, .., X_n\sim \Gamma(\alpha,\beta)\):

\[ \begin{aligned} &l(\alpha,\beta) =\log L(\alpha,\beta) = \\ &\log\left[\frac1{\Gamma(\alpha)^n\beta^{n\alpha}}(\prod_{i=1}^n x_i)^{\alpha-1}\exp\{-(\sum x_i)/\beta\} \right] = \\ &-\log \Gamma(\alpha)-n\alpha\log\beta +(\alpha-1)\sum (\log x_i) -(\sum x_i)/\beta \end{aligned} \]

alpha <- 0.5; beta <- 2.1
dta <- rgamma(n, alpha, beta)
loglike_alpha <- function(x) {
  y <-  x
  for (i in seq_along(x)) 
    y[i] <-  sum(log(dgamma(dta, x[i],  1/beta)))
  y
}
ggcurve(fun=loglike_alpha, A=0.1, B=1)

Example (2.3.10)

\(Y_1\sim N(\mu_1,\sigma_1)\), \(Y_2\sim N(\mu_2,\sigma_2)\), \(Z\sim Ber(p)\) and \(X=(1-Z)Y_1+ZY_2\).

\[l(p,\mu_1,\sigma_1,\mu_2,\sigma_1|\pmb{x}) = \log L(p,\mu_1,\sigma_1,\mu_2,\sigma_1|\pmb{x})=\\ \sum_{i=1}^n \log \left\{ \frac{p}{\sqrt{2\pi\sigma_1^2}}\exp\left(-\frac{(x_i-\mu_1)^2}{2\sigma_1^2}\right)+\frac{1-p}{\sqrt{2\pi\sigma_1^2}}\exp\left(-\frac{(x_i-\mu_2)^2}{2\sigma_1^2}\right) \right\}\]

Example (2.3.11)

say X has a multinomial distribution with parameters m, \(p_1,..,p_k\), then

\[l(m,p_1,..,p_k|\pmb{x}) = \log L(m,p_1,..,p_k|\pmb{x})=\\ \frac{m!}{x_1!\cdot..\cdot x_k!}\prod_{i=1}^k p_i^{x_i}=\\ \log\left(\frac{m!}{x_1!\cdot..\cdot x_k!}\right)+\sum_{i=1}^k x_i \log(p_i)\]

Example (2.3.12)

Say X belongs to an exponential family, then

\[ \begin{aligned} &l(\theta; \pmb{x}) = \\ &\log \left[ h(x)\exp \left\{ \theta^T T(x) -A(\theta) \right\} \right] = \\ &\log h(x) +\theta^T T(x) -A(\theta) \end{aligned} \] and in this case the log-likelihood function simplifies nicely.

The Likelihood Principle

Definition (2.3.13)

Likelihood Principle

If \(\pmb{x}\) and \(\pmb{x}\) are two sample points such that

\[L(\theta|\pmb{x})=C(\pmb{x,y})\times L(\theta|\pmb{y})\]

for all \(\theta\), then the conclusion drawn from \(\pmb{x}\) and \(\pmb{y}\) should be identical.

So if two sample points have proportional likelihoods, they contain the same information about the parameter.

Example (2.3.14)

say \(X_1, .., X_n \sim N(\mu,\sigma)\), \(Y_1, .., Y_n \sim N(\mu,\sigma)\) and assume \(\sigma\) is known. Then

\[ \begin{aligned} &\frac{L(\mu|\pmb{x})}{L(\mu|\pmb{y})} = \\ &\frac{(2\pi\sigma^2)^{n/2}\exp \left\{-\frac1{2\sigma^2}\sum (x_i-\mu)^2 \right\} }{(2\pi\sigma^2)^{n/2}\exp \left\{-\frac1{2\sigma^2}\sum (y_i-\mu)^2 \right\}} = \\ &(2\pi\sigma^2)^{n/2}\exp \left\{-\frac1{2\sigma^2}\left[\sum (x_i-\mu)^2-\sum (y_i-\mu)^2\right] \right\} \end{aligned} \]

Now

\[ \begin{aligned} &\sum(x_i-\mu)^2 = \\ &\sum(x_i-\bar{x}+\bar{x}-\mu)^2 = \\ &\sum \left[(x_i-\bar{x})^2+2(x_i-\bar{x})(\bar{x}-\mu)+(\bar{x}-\mu)^2 \right]= \\ &\sum (x_i-\bar{x})^2+2(\bar{x}-\mu)\sum (x_i-\bar{x})+n(\bar{x}-\mu)^2 = \\ &\sum (x_i-\bar{x})^2+2(\bar{x}-\mu) (\sum x_i-n\bar{x})+n(\bar{x}-\mu)^2 = \\ &\sum (x_i-\bar{x})^2+n(\bar{x}-\mu)^2 \end{aligned} \] and so

\[ \begin{aligned} &\frac{L(\mu|\pmb{x})}{L(\mu|\pmb{y})} = \\ &(2\pi\sigma^2)^{n/2}\exp \left\{-\frac1{2\sigma^2}\left[\sum (x_i-\bar{x})^2+n(\bar{x}-\mu)^2-\sum (y_i-\bar{y})^2-n(\bar{y}-\mu)^2\right] \right\} = \\ &(2\pi\sigma^2)^{n/2}\exp \left\{-\frac{n}{2\sigma^2}\left[(\bar{x}-\mu)^2-(\bar{y}-\mu)^2\right] \right\}\times\\ &\exp \left\{-\frac1{2\sigma^2}\left[\sum (x_i-\bar{x})^2-\sum (y_i-\bar{y})^2\right] \right\} =\\ &C(\pmb{x},\pmb{y}) \end{aligned} \]

iff

\[(\bar{x}-\mu)^2-(\bar{y}-\mu)^2 = 0\]

this has to hold for all \(\mu\), which implies \(\bar{x}=\bar{y}\).

So according to the likelihood principle if two experiments with the probability model \(N(\mu,\sigma)\), \(\sigma\) known observe the same sample means, they should give the same result.

Example (2.3.15)

Consider the following problem: we have a Bernoulli trial with success parameter p, and we wish to estimate p. 

Experiment 1: in this experiment we repeat the Bernoulli trial 20 times, so the rv X~Bin(20,p). We find x=7. Therefore

\[L_1(p|7)= {{20}\choose 7} p^7(1-p)^{13}\]

Experiment 2: in this experiment we repeat the Bernoulli trials until the 7th success, so the rv Y~NegBin(7,p). We find y=20, therefore

\[L_2(p|20)= {{19}\choose 6} p^7(1-p)^{13}\]

so now \(L_1(p|7)=cL_2(p|20)\) and so according to the likelihood principle both experiments should result in the same estimate of p, regardless of the fact that we performed completely different experiments.

The likelihood principle is a good general principle for a statistical procedure but there are common situations were it is violated. For example, an important task in Statistics is model checking. Say for example we have the following probability model: \(X_1, .., X_n\sim N(\mu,\sigma)\) and we want to estimate \(\mu\). But then we worry whether our data really follows a normal distribution, so we do some checking, for example draw a boxplot. This, though, violates the likelihood principle because for one data set we might decide that the normal assumption is wrong whereas for another we might accept it, even though both have the same sample mean.