Point Estimation

Properties of Estimators

Estimators can have a number of nice properties. One of them is

Definition

An estimator = T(X₁, ..,X_n) is called unbiased for θ if

E =θ

B( ) = E -θ

is called the bias of the estimator T

Example: say X₁, ..,X_n are iid with mean μ and standard deviation σ. Then

EX̅ = E[(1/n)∑X_i] = (1/n)∑E[X_i] = (1/n)∑μ = (1/n)*nμ = μ

so X̅ is an uniased estimator of μ

How about the variance σ²? Here we will use as an estimator the sample variance

and this of course explains why we use "n-1" instead of "n".

How about the standard deviation σ? Unfortunately

E√S²≠ √ES²

(actually, a famous inequality called Jensen's inequality says E√S²≤ √ES²)

so S = √S² is NOT an unbiased estimator of the standard devation s. In fact, in this generality nothing more can be said. Howerver, if we also assume that the X_i come from a normal distribution it can be shown (Holzman 1950) that

E[c_nS]=σ

where

c_n = {Γ[(n-1)/2][(n-1)/2)]^1/2}/Γ(n/2)

Surprised that the Gamma function shows up here? Remember that if X_i,..,X_n is iid Normal, S² has a chisquared distribution and the chisquare is a special case of thr Gamma distribution.

Example: say X₁, ..,X_n are iid U[0,θ]. Find an unbiased estimator of θ.
we have X_i<θ for all i, so max{X_i}<θ as well. In fact we would expect the largest of the X_i to be close to θ. So let's try as an extimator T=max{X_i}

so T is not an unbiased estimator of θ, but of course it is easy to make it so, just use (n+1)/nθ!

Method of Moments

Let X = X₁, ..,X_n be a sample from a distribution with pmf(pdf) f(x|θ₁,..,θ_k).

Definition

the i^th sample moment is defined by

m_i = (Xⁱ₁+..+Xⁱ_n)/n

Analogously define the i^th population moment by

μ_i = EXⁱ

Of course μ_i is a function of the θ₁,..,θ_k. So we can find estimators of θ₁,..,θ_k by solving the system of k equations in k unknowns

m_i=μ_i i=1,..,k

Example: say X₁, ..,X_n are iid U[0,θ]

we have

so we get

m₁ = μ₁ = θ/2

or = 2m₁ = 2X̅

Example : say X₁, ..,X_n are iid N(μ,σ).

Maximum Likelihood

The idea here is this: the likelihood function gives the likelihood (not the probability!) of a value of the parameter given the observed data, so why not choose the value that "matches" (gives the greatest likelihood) to the observed data.

Example: say X₁, ..,X_n are iid Ber(p). First notice that a function f has an extremal point at x iff log(f) does as well because

d/dx{log(f(x))}=f'(x)/f(x)=0 iff f'(x)=0

Now

Here we verified that the extrema is really a maxima. Recall that we previously said that the log-likelihood of the normal distribution as a function of μ is a quadratic opening downward, and that because of the central limit theorem the log-likelihood function of many distributions approaches that of a normal. For this reason the log-likelihood function (almost) always has a maximum at the mle. There are exceptions, though:

Example: say X₁, ..,X_n are iid U[0,θ], θ>0. Then

Now L(θ|x) is 0 on (0,max(x_i)), at max(x_i) it jumps to 1/(max(x_i))ⁿ and then monotonically decreases as θ gets bigger, so the maximum is obtained at θ=max(x_i), therefore the mle is max(x_i)

Here is an example: say n=10 and max{x_i}=1.55, then the likelihood curve looks like this

and it has its maximum at 1.55

Notice that here log f is of no use because f(x)=0 for values of x close to the point were the maximum is obtained.

Example X₁, .., X_n~N(μ,σ):

Example say X has a multinomial distribution with parameters p₁,..,p_k (we assume m is known), then if we simply find the derivatives of the log-likelihood we find

and this system has no solution. The problem is that we are ignoring the condition p₁+..+p_k=1. So we really have the problem

Minimize l(p₁,..,p_k) given p₁+..+p_k=1

One way to do this is with the method of Lagrange multipliers: minimize

l(p₁,..,p_k)-λ(p₁+..+p_k-1):

Maximum likelihood estimators have a number of nice properties. One of them is their invariance under transformations. That is if is the mle of θ, then g() is the mle of g(θ)

Example say X₁, .., X_n~Ber(p) so we know that the mle isX̅. Say we are interested in θ=p-q=p-(1-p)=2p-1, the difference in proportions. Therefore 2X̅-1 is the mle of θ.

Let's see whether we can verify that. First if θ=2p-1 we have p=(1+θ)/2 and so

Example say X₁, .., X_n~N(μ,σ) so we know that the mle of σ²is S² . But then the mle of σ is s.

Definition

Let X=(X₁,..,X_n) be a random sample from some pmf (pdf) f(.,θ). Then the quantity

is called the Fisher information number of the sample.

If the sample is identitically distributied things simplify:

We said before that often the log-likelihood the log-likelihood function of many distributions approaches that of a normal. We can now make this statement precise:

Theorem Let X₁, .., X_n be iid f(x|θ). Let denote the mle of θ. Under some regularity conditions on f we have

√n[-θ]→N(0,√-v)

where v is the Fisher information number

Example say X₁, .., X_n are iid Ber(p) and we want to estimate p. We saw above that the mle is given by X̅ .

Example say X_i~Exp(λ), i=1,2,.. and the X_i are independent. Then we saw before that if S_n=(X₁+..+X_n)

l(λ;x) = nlog(λ)-λS_n .

Now

here are 4 examples. n=10, λ=1. The true log-likelihood curve is in black and the parabola from the normal approximation is in blue.

Bayesian Point Estimation

We have already seen how to use a Bayesian approach to do finding point estimators, namely using the mean of the posterior distribution. Of course one could also use the median or any other measure of central tendency. A popular choice for example is the mode of the posterior distribution.

Example Let's say we have X₁, .., X_n~Ber(p) and p~Beta(α,β), then we already know that

p|x₁,..x_n~Beta(α+∑x_i,n-∑x_i+β)

and so we can estimate p as follows:

Example: A switch board keeps track of the number of phone calls they receive per hour during the day:

1 5 4 4 3 1 5 5 5 6 7 5 4 4 4 2 4 7 0 3 4 9 4 2

Let's say they believe that the number of calls has a Poisson distribution with rate λ.

a) find the maximum likelihood estimator of the rate λ

and we find X̅ = 4.08

b) Find the Bayes estimator for λ as the median of the posterior distribution . Use as a prior π(λ)=1/λ, λ>0

This is of course an improper prior because ∫π(λ)dλ=∞

Now

and we get the same answer!