Definition
An estimator = T(X1, ..,Xn) is called unbiased for θ if
E =θ
B( ) = E
-θ
is called the bias of the estimator T
Example: say X1, ..,Xn are iid with mean μ and standard deviation σ. Then
EX̅ = E[(1/n)∑Xi] = (1/n)∑E[Xi] = (1/n)∑μ = (1/n)*nμ = μ
so X̅ is an uniased estimator of μ
How about the variance σ2? Here we will use as an estimator the sample variance
and this of course explains why we use "n-1" instead of "n".
How about the standard deviation σ? Unfortunately
E√S2 ≠ √ES2
(actually, a famous inequality called Jensen's inequality says E√S2 ≤ √ES2)
so S = √S2 is NOT an unbiased estimator of the standard devation s. In fact, in this generality nothing more can be said. Howerver, if we also assume that the Xi come from a normal distribution it can be shown (Holzman 1950) that
E[cnS]=σ
where
cn = {Γ[(n-1)/2][(n-1)/2)]1/2}/Γ(n/2)
Surprised that the Gamma function shows up here? Remember that if Xi,..,Xn is iid Normal, S2 has a chisquared distribution and the chisquare is a special case of thr Gamma distribution.
Example: say X1, ..,Xn are iid U[0,θ]. Find an unbiased estimator of θ.
we have Xi<θ for all i, so max{Xi}<θ as well. In fact we would expect the largest of the Xi to be close to θ. So let's try as an extimator T=max{Xi}
so T is not an unbiased estimator of θ, but of course it is easy to make it so, just use (n+1)/nθ!
Let X = X1, ..,Xn be a sample from a distribution with pmf(pdf) f(x|θ1,..,θk).
Definition
the ith sample moment is defined by
mi = (Xi1+..+Xin)/n
Analogously define the ith population moment by
μi = EXi
Of course μi is a function of the θ1,..,θk. So we can find estimators of θ1,..,θk by solving the system of k equations in k unknowns
mi=μi i=1,..,k
Example: say X1, ..,Xn are iid U[0,θ]
we have
so we get
m1 = μ1 = θ/2
or = 2m1 = 2X̅
Example : say X1, ..,Xn are iid N(μ,σ).
The idea here is this: the likelihood function gives the likelihood (not the probability!) of a value of the parameter given the observed data, so why not choose the value that "matches" (gives the greatest likelihood) to the observed data.
Example: say X1, ..,Xn are iid Ber(p). First notice that a function f has an extremal point at x iff log(f) does as well because
d/dx{log(f(x))}=f'(x)/f(x)=0 iff f'(x)=0
Now
Here we verified that the extrema is really a maxima. Recall that we previously said that the log-likelihood of the normal distribution as a function of μ is a quadratic opening downward, and that because of the central limit theorem the log-likelihood function of many distributions approaches that of a normal. For this reason the log-likelihood function (almost) always has a maximum at the mle. There are exceptions, though:
Example: say X1, ..,Xn are iid U[0,θ], θ>0. Then
Now L(θ|x) is 0 on (0,max(xi)), at max(xi) it jumps to 1/(max(xi))n and then monotonically decreases as θ gets bigger, so the maximum is obtained at θ=max(xi), therefore the mle is max(xi)
Here is an example: say n=10 and max{xi}=1.55, then the likelihood curve looks like this
and it has its maximum at 1.55
Notice that here log f is of no use because f(x)=0 for values of x close to the point were the maximum is obtained.
Example X1, .., Xn~N(μ,σ):
Example say X has a multinomial distribution with parameters p1,..,pk (we assume m is known), then if we simply find the derivatives of the log-likelihood we find
and this system has no solution. The problem is that we are ignoring the condition p1+..+pk=1. So we really have the problem
Minimize l(p1,..,pk) given p1+..+pk=1
One way to do this is with the method of Lagrange multipliers: minimize
l(p1,..,pk)-λ(p1+..+pk-1):
Maximum likelihood estimators have a number of nice properties. One of them is their invariance under transformations. That is if is the mle of θ, then g(
) is the mle of g(θ)
Example say X1, .., Xn~Ber(p) so we know that the mle isX̅. Say we are interested in θ=p-q=p-(1-p)=2p-1, the difference in proportions. Therefore 2X̅-1 is the mle of θ.
Let's see whether we can verify that. First if θ=2p-1 we have p=(1+θ)/2 and so
Example say X1, .., Xn~N(μ,σ) so we know that the mle of σ2 is S2 . But then the mle of σ is s.
Definition
Let X=(X1,..,Xn) be a random sample from some pmf (pdf) f(.,θ). Then the quantity
is called the Fisher information number of the sample.
If the sample is identitically distributied things simplify:
We said before that often the log-likelihood the log-likelihood function of many distributions approaches that of a normal. We can now make this statement precise:
Theorem Let X1, .., Xn be iid f(x|θ). Let denote the mle of θ. Under some regularity conditions on f we have
√n[-θ]→N(0,√-v)
where v is the Fisher information number
Example say X1, .., Xn are iid Ber(p) and we want to estimate p. We saw above that the mle is given by X̅ .
Example say Xi~Exp(λ), i=1,2,.. and the Xi are independent. Then we saw before that if Sn=(X1+..+Xn)
l(λ;x) = nlog(λ)-λSn .
Now
here are 4 examples. n=10, λ=1. The true log-likelihood curve is in black and the parabola from the normal approximation is in blue.
We have already seen how to use a Bayesian approach to do finding point estimators, namely using the mean of the posterior distribution. Of course one could also use the median or any other measure of central tendency. A popular choice for example is the mode of the posterior distribution.
Example Let's say we have X1, .., Xn~Ber(p) and p~Beta(α,β), then we already know that
p|x1,..xn~Beta(α+∑xi,n-∑xi+β)
and so we can estimate p as follows:
Example: A switch board keeps track of the number of phone calls they receive per hour during the day:
1 5 4 4 3 1 5 5 5 6 7 5 4 4 4 2 4 7 0 3 4 9 4 2
Let's say they believe that the number of calls has a Poisson distribution with rate λ.
a) find the maximum likelihood estimator of the rate λ
and we find X̅ = 4.08
b) Find the Bayes estimator for λ as the median of the posterior distribution . Use as a prior π(λ)=1/λ, λ>0
This is of course an improper prior because ∫π(λ)dλ=∞
Now
and we get the same answer!