The Likelihood Function

One of the most important "objects" in Statistics is the likelihood function" defined as follows:

Let X=(X1, .., Xn) be a random vector with joint pdf f(x|θ). then the likelihood function L is defined as

L(θ|x)=f(x|θ)

This must be one of the most deceivingly simple equations in math, actually it seems to be just a change in notation: L instead of f. What really matters and makes a huge difference is that in pdf we consider the x's as variables and the θ as fixed, in the likelihood function we consider the θ as the variable and the x's as fixed. Essentially we consider what happens after an experiment is done, that is what the data can tell us about the parameter.

Things simplify a bit if X1, .., Xn is an iid sample. Then the joint density is given by

f(x|θ)=∏f(xi|θ)

Example X1, .., Xn~Ber(p):

Example X1, .., Xn~N(μ,σ):

Example X1, .., Xn~Γ(α,β):

Example An urn contains N balls. n i of these balls have the number "i" on them, i=1,..,k and ∑ni=N. Say we randomly select a ball from the urn, note its number and put it back. We repeat this m times. Let the rv Xi be the number of balls with the number i that were drawn, and let X=(X1, .., Xk).

the pmf of X is given

for any x1,..,xk with xi≥0 and ∑xi=m

Now let's assume we don't know n1,..nk and want to estimate them. First we can make a slight change in the parameterization:

pi=ni/N i=1,..,k

The resulting random vector is called the multinomial rv with parameters m, p1, .., pk.

Note if k=2 X1~Bin(m,p1) and X2~Bin(m,p2), so the multinomial is a generalization of the binomial.

the likelihood function is given by

where p1+.+.pk=1 and x1+..xk=m

There is a common misconception about the likelihood function: because it is the same as the pdf (pmf) it has the same properties. This is not true because the likelihood function is a function of the parameters, not the variables.

Example X~Ber(p), so f(x)=(1-p)1-xpx, x=0,1, 0<p<1

As a function of x with a fixed p we have f(x)≥0 for all x and f(0)+f(1)=1 but as a function of p with a fixed x, say x=1, we have

It turns out that for many problems the log of the likelihood function is more manageable entity, mainly because it turns the product into a sum:

Example X1, .., Xn~Ber(p)

(worry about Xi=0 for all i or Xi=1 for all i yourself)

Example X1, .., Xn~N(μ,σ):

Notice that as a function of μ the log-likelihood curve of a normal is a quadratic. The coefficient in front of the quadratic term is negative, so its graph is a parabola that opens downward. Moreover, the log like-lihood is a sum of iid terms (Xi-μ)2 , so an application of the central limit theorem shows that the same is true quite often even if the original distribution is not normal.

Example say Xi~Exp(λ), i=1,2,.. and the Xi are independent. Then

In the following 4 graphs we draw this curve for n=10, λ=1 and 4 random samples: