Point Estimation

Properties of Estimators

Estimators can have a number of nice properties. One of them is


An estimator = T(X1, ..,Xn) is called unbiased for θ if


B( ) = E

is called the bias of the estimator T

Example: say X1, ..,Xn are iid with mean μ and standard deviation σ. Then

EX̅ = E[(1/n)∑Xi] = (1/n)∑E[Xi] = (1/n)∑μ = (1/n)*nμ = μ

so X̅ is an uniased estimator of μ

How about the variance σ2? Here we will use as an estimator the sample variance

and this of course explains why we use "n-1" instead of "n".

How about the standard deviation σ? Unfortunately

E√S2 ≠ √ES2

(actually, a famous inequality called Jensen's inequality says E√S2 ≤ √ES2)

so S = √S2 is NOT an unbiased estimator of the standard devation s. In fact, in this generality nothing more can be said. Howerver, if we also assume that the Xi come from a normal distribution it can be shown (Holzman 1950) that



cn = {Γ[(n-1)/2][(n-1)/2)]1/2}/Γ(n/2)

Surprised that the Gamma function shows up here? Remember that if Xi,..,Xn is iid Normal, S2 has a chisquared distribution and the chisquare is a special case of thr Gamma distribution.

Example: say X1, ..,Xn are iid U[0,θ]. Find an unbiased estimator of θ.
we have Xi<θ for all i, so max{Xi}<θ as well. In fact we would expect the largest of the Xi to be close to θ. So let's try as an extimator T=max{Xi}

so T is not an unbiased estimator of θ, but of course it is easy to make it so, just use (n+1)/nθ!

Method of Moments

Let X = X1, ..,Xn be a sample from a distribution with pmf(pdf) f(x|θ1,..,θk).


the ith sample moment is defined by

mi = (Xi1+..+Xin)/n

Analogously define the ith population moment by

μi = EXi

Of course μi is a function of the θ1,..,θk. So we can find estimators of θ1,..,θk by solving the system of k equations in k unknowns

mii i=1,..,k

Example: say X1, ..,Xn are iid U[0,θ]

we have

so we get

m1 = μ1 = θ/2

or = 2m1 = 2X̅

Example : say X1, ..,Xn are iid N(μ,σ).

Maximum Likelihood

The idea here is this: the likelihood function gives the likelihood (not the probability!) of a value of the parameter given the observed data, so why not choose the value that "matches" (gives the greatest likelihood) to the observed data.

Example: say X1, ..,Xn are iid Ber(p). First notice that a function f has an extremal point at x iff log(f) does as well because

d/dx{log(f(x))}=f'(x)/f(x)=0 iff f'(x)=0


Here we verified that the extrema is really a maxima. Recall that we previously said that the log-likelihood of the normal distribution as a function of μ is a quadratic opening downward, and that because of the central limit theorem the log-likelihood function of many distributions approaches that of a normal. For this reason the log-likelihood function (almost) always has a maximum at the mle. There are exceptions, though:

Example: say X1, ..,Xn are iid U[0,θ], θ>0. Then

Now L(θ|x) is 0 on (0,max(xi)), at max(xi) it jumps to 1/(max(xi))n and then monotonically decreases as θ gets bigger, so the maximum is obtained at θ=max(xi), therefore the mle is max(xi)

Here is an example: say n=10 and max{xi}=1.55, then the likelihood curve looks like this

and it has its maximum at 1.55

Notice that here log f is of no use because f(x)=0 for values of x close to the point were the maximum is obtained.

Example X1, .., Xn~N(μ,σ):

Example say X has a multinomial distribution with parameters p1,..,pk (we assume m is known), then if we simply find the derivatives of the log-likelihood we find

and this system has no solution. The problem is that we are ignoring the condition p1+..+pk=1. So we really have the problem

Minimize l(p1,..,pk) given p1+..+pk=1

One way to do this is with the method of Lagrange multipliers: minimize


Maximum likelihood estimators have a number of nice properties. One of them is their invariance under transformations. That is if is the mle of θ, then g() is the mle of g(θ)

Example say X1, .., Xn~Ber(p) so we know that the mle isX̅. Say we are interested in θ=p-q=p-(1-p)=2p-1, the difference in proportions. Therefore 2X̅-1 is the mle of θ.

Let's see whether we can verify that. First if θ=2p-1 we have p=(1+θ)/2 and so

Example say X1, .., Xn~N(μ,σ) so we know that the mle of σ2 is S2 . But then the mle of σ is s.


Let X=(X1,..,Xn) be a random sample from some pmf (pdf) f(.,θ). Then the quantity

is called the Fisher information number of the sample.

If the sample is identitically distributied things simplify:

We said before that often the log-likelihood the log-likelihood function of many distributions approaches that of a normal. We can now make this statement precise:

Theorem Let X1, .., Xn be iid f(x|θ). Let denote the mle of θ. Under some regularity conditions on f we have


where v is the Fisher information number

Example say X1, .., Xn are iid Ber(p) and we want to estimate p. We saw above that the mle is given by X̅ .

Example say Xi~Exp(λ), i=1,2,.. and the Xi are independent. Then we saw before that if Sn=(X1+..+Xn)

l(λ;x) = nlog(λ)-λSn .


here are 4 examples. n=10, λ=1. The true log-likelihood curve is in black and the parabola from the normal approximation is in blue.

Bayesian Point Estimation

We have already seen how to use a Bayesian approach to do finding point estimators, namely using the mean of the posterior distribution. Of course one could also use the median or any other measure of central tendency. A popular choice for example is the mode of the posterior distribution.

Example Let's say we have X1, .., Xn~Ber(p) and p~Beta(α,β), then we already know that


and so we can estimate p as follows:

Example: A switch board keeps track of the number of phone calls they receive per hour during the day:

1 5 4 4 3 1 5 5 5 6 7 5 4 4 4 2 4 7 0 3 4 9 4 2

Let's say they believe that the number of calls has a Poisson distribution with rate λ.

a) find the maximum likelihood estimator of the rate λ

and we find X̅ = 4.08

b) Find the Bayes estimator for λ as the median of the posterior distribution . Use as a prior π(λ)=1/λ, λ>0

This is of course an improper prior because ∫π(λ)dλ=∞


and we get the same answer!