est1

Properties of Estimators

We will use the following notation: X has density \(f(x|\theta)\) indicates that the density depends on a parameter \(\theta\) (which could be a vector). For example \(\theta=(\mu, \sigma)\) for the normal.

Definition (4.1.1)

Say we have \(\pmb{X} = (X_1, ... ,X_n)\) with density \(f(x|\theta)\). Then any function of the data \(T(\pmb{X}) = T(X_1, ... ,X_n)\)is called a statistic. If it is meant to estimate \(\theta\) it is called an estimator of \(\theta\).

We will now discuss a number of properties of estimators. All these properties are equally important for Bayesians and Frequentists.

Unbiased Estimators

An estimator T is called unbiased for \(\theta\) if

\[E[T(X_1, ... ,X_n)]=\theta\]

Example (4.1.2)

say \(X_1, ... ,X_n\sim U[0,\theta]\).

We will consider two possible estimators, one based on the sample mean and another based on the maximum:

\(E[\bar{X}]=E[X_1]=\theta/2\)

Let \(T_1(\pmb{x})=2\bar{x}\), then \(E[T_1]=\theta\), so \(T_1\) is unbiased.

Consider \(T_2=\max\{X_i\}\). Recall that if \(X_i\sim U[0,\theta]\) we have \(X_i/\theta\sim U[0,1]\). Also we know that the order statistics of independent U[0,1] have a Beta distribution, and so \(T_2/\theta\sim Beta(n,1)\). Finally

\[E[T_2]=\theta E[T_2/\theta]=\theta\frac{n}{n+1}\] and so \(\frac{n+1}{n}T_2\) is unbiased.

Bias

the bias of an estimator is defined by

\[ \text{bias}(T) = E[T]-\theta \]

Example (4.1.3)

say \(X_1, ... ,X_n\sim U[0,\theta]\), then

\[ \begin{aligned} &\text{bias}(T_2) = ET_2-\theta = \\ &\theta\frac{n}{n+1}-\theta = -\frac{\theta}{n+1} \end{aligned} \]

Mean Square Error

Definition (4.1.4)

The mean square error of an estimator is defined by

\[ \text{MSE}(\theta) = E\left[ ||T(\pmb{X})-\theta||^2\right] \] where \(||.||\) is some norm.

Example (4.1.5)

\(X_1,..X_n \sim N(\mu,1)\) and \(T(\pmb{x})=\bar{x}\). Then

\[ \begin{aligned} &E\left[ ||T(\pmb{X})-\theta||^2\right] = \\ &E\left[ |\bar{X}-\mu|^2\right] = \\ &E\left[ (\bar{X}-\mu)^2\right] = \\ &var(\bar{X})=1/n \end{aligned} \]

Example (4.1.6)

\(X_1, ... ,X_n\sim U[0,\theta]\)

\[ \begin{aligned} &E\left[ ||T_2-\theta||^2\right] = \\ &E\left[ (T_2-\frac{n\theta}{n+1}+\frac{n\theta}{n+1}-\theta)^2\right] = \\ &E(T_2-\frac{n\theta}{n+1})^2+ 2(ET_2-\frac{n\theta}{n+1})(\frac{n\theta}{n+1}-\theta)+ (\frac{n\theta}{n+1}-\theta)^2 = \\ &var(T_2)+0+(E[T_2]-\theta)^2 = \\ &var(T_2)+\text{bias}(T_2)^2 \end{aligned} \]

It turns out to be quite true in general that

\[\text{MSE}(T) = var(T)+\text{bias}(T)^2\]

Because the mean square error combines both variance and bias is seems natural that a good estimator should have a small mean square error, and in fact many estimation methods attempt to minimize the mean square error.

From the formula it is clear that a small bias might be acceptable if it also has a small variance, because then we get a small mean square error.

This is often referred to as the bias-variance trade-off.

Stein’s phenomenon, James-Stein estimator

Say we have \(X_1,..X_n \sim N(\mu,\sigma)\), \(\sigma\) known and \(T(\pmb{X})=\bar{X}\). Then it can be shown that T minimizes the mean square error.

Let’s consider the multidimensional version of this problem:

\(X_1,..X_n \sim N(\mu,\sigma \pmb{I})\), where \(\mu \in R^d\), \(\sigma\) is known and \(\pmb{I}\) is the identity matrix. We want to estimate \(\mu\).

Because the variance-covariance matrix is a diagonal matrix the covariances are zero, and so the \(X_i\) are independent. It seems therefore natural that the estimator corresponding to \(\bar{x}\) should be good estimator. It came therefore as shock to the Statisticians when in 1956 Stain showed that if d>2, this estimator is inadmissible, that is no matter what \(n, \mu\) and \(\sigma\) are, there exists an estimator with lower mean square error.

An example is the James-Stein estimator, given by

\[ \hat{\mu}_{JS}=\left( 1-\frac{(d-2)\sigma^2}{||x||^2} \right)\bar{x} \] Notice that it if \((d-2)\sigma^2<||x||^2\), this shrinks the estimator towards 0. For this reason this is an example of a class of estimators called shrinkage estimators.

A consequence of the above discussion is the following counter intuitive result: When three or more unrelated parameters are measured, their total mean square error can be reduced by using a combined estimator such as the James–Stein estimator; whereas when each parameter is estimated separately, the least squares (LS) estimator is admissible.

An example would be estimating the age of the Universe, the GPA of the undergraduate students at the Colegio, and the average beer consumption in Germany, all together!

Sufficient Statistics

Definition (4.1.7)

A statistic T is a sufficient statistic for \(\theta\) if the conditional distribution of the sample X given the value of T(X) does not depend on \(\theta\)

The meaning of “sufficient statistic” is that all the information about the parameter \(\theta\) is contained in T, so any inference about \(\theta\) (such as an estimator, a hypothesis test or a confidence interval) can be based on T.

Theorem (4.1.8)

If \(f(\pmb{x}\vert \theta)\) is the joint density of X and \(q(t\vert \theta)\) is the density of T(X), then T(X) is a sufficient statistic for \(\theta\) if, for every \(\pmb{x}\) in the sample space the ratio

\[f(\pmb{x}\vert \theta)/q(t\vert \theta)\]

is constant as a function of \(\theta\).

Example (4.1.9)

say \(X_1, ... ,X_n\sim Ber(p)\) (Here \(\theta=p\)). Let \(T(\pmb{x})=\sum x_i\). T is the number of “successes” in n independent Bernoulli trials, and so \(T\sim Bin(n,p)\). Now

\[f(\pmb{x}\vert p) = \prod_{i=1}^n \left(p^{x_i}(1-p)^{1-x_i}\right)=\] \[p^{\sum_{i=1}^n x_i}(1-p)^{n-\sum_{i=1}^n x_i}=p^{T(\pmb{x})}(1-p)^{n-T(\pmb{x})}\]

and

\[f(\pmb{x}\vert p)/q(t\vert p)=\frac{p^{T(\pmb{x})}(1-p)^{n-T(\pmb{x})}}{{{n} \choose {T(\pmb{x})}}p^{T(\pmb{x})}(1-p)^{n-T(\pmb{x})}} = \frac1{{{n} \choose {T(\pmb{x})}}}\]

We see that the ratio is a constant with respect to p and so T is a sufficient statistic for p.

Theorem (4.1.10)

(Factorization Theorem)

Let \(f(\pmb{x}\vert \theta)\) be the joint density of X. A statistic T(X) is a sufficient statistic for \(\theta\) if and only if there exist functions \(g(t\vert \theta)\) and \(h(\pmb{x})\) such that for every \(\pmb{x}\) in the sample space and all values of the parameter we have

\[f(\pmb{x}\vert \theta)=g(t\vert \theta)h(\pmb{x})\]

Example (4.1.11)

say \(X_1, ...,X_n\sim N(\mu,\sigma)\) and we assume that \(\sigma\) is known, so \(\theta=\mu\). Then

\[ \begin{aligned} &f(\pmb{x}\vert \mu) =\prod_{i=1}^n \frac1{\sqrt{2\pi\sigma^2}}\exp \left\{-\frac1{2\sigma^2}\left(x_i-\mu\right)^2 \right\} =\\ &(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^n\left(x_i-\mu\right)^2 \right\} =\\ &(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\left[\sum_{i=1}^n\left(x_i-\bar{x}\right)^2+n\left(\mu-\bar{x}\right)^2 \right]\right\} =\\ &(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^n\left(x_i-\bar{x}\right)^2\right\}\exp \left\{-\frac{n}{2\sigma^2}\left(\mu-\bar{x}\right)^2\right\} =\\ &h(\pmb{x})g(t|\mu) \end{aligned} \]

where

\[ \begin{aligned} &t=\bar{x} \\ &h(\pmb{x}) =(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^n\left(x_i-t\right)^2\right\} \\ &g(t|\mu) = \exp \left\{-\frac{n}{2\sigma^2}\left(\mu-t\right)^2\right\}=\\ &h(\pmb{x})g(t|\lambda) \end{aligned} \]

and we see that the sample mean is a sufficient statistic for the population mean \(\mu\), at least if the variance is known.

Example (4.1.12)

say \(X_1, ...,X_n\sim Pois(\lambda)\). Then

\[f(\pmb{x}|\lambda) =\prod_{i=1}^n \frac{\lambda^{x_i}}{x_i!}e^{-\lambda} = \frac{\lambda^{\sum_{i=1}^nx_i}}{\prod_{i=1}^nx_i!}e^{-n\lambda}=h(\pmb{x})g(t|\lambda)\]

where

\[ \begin{aligned} &t=\sum_{i=1}^nx_i \\ &h(\pmb{x}) =\frac{1}{\prod_{i=1}^nx_i!} \\ &g(t|\mu) = \lambda^{t} e^{-n\lambda} \end{aligned} \] so \(T(\pmb{x})=\sum x_i\) is a sufficient statistic for \(\lambda\) .

Example (4.1.13)

Say X belongs to an exponential family, then

\[ f(\pmb{x}|\theta) =h(x)\exp \left\{ \theta^T T(\pmb{x}) -A(\theta) \right\} \]

and clearly \(T(\pmb{x})\) is a sufficient statistic.

Ancillary Statistics

Definition (4.1.14)

A statistic S(X) whose distribution does not depend on \(\theta\) is called an ancillary statistic.

Example (4.1.15)

say \(X_1, ...,X_n\sim U[\theta,\theta+1]\) and let R be the range of the observations, that is \(R = \max\{x_i\} -\min\{x_i\}\). It can be shown that \(R \sim Beta(n-1,2)\) for all \(\theta\), and so its distribution is independent of \(\theta\).

Example (4.1.16)

say \(X_1, ...,X_n\sim N(\mu,\sigma)\) and let s be the sample standard deviation, then

\[(n-1)s^2/\sigma^2\sim\chi^2(n-1)\]

and is independent of \(\mu\).

The usefulness of an ancillary statistics lies in the fact that we need not worry about the value of the unknown parameter when calculating probabilities.

Consistency

Definition (4.1.17)

A sequence of estimators \(T_n = T_n(X_1, ..,X_n)\) is a consistent sequence of estimators for \(\theta\) if \(T_n \rightarrow \theta\) in probability. That is, for every \(\epsilon>0\) and every \(\theta\) we have

\[\lim P(|T_n-\theta|>\epsilon) \rightarrow 0\]

Example (4.1.18)

By the WLLN if \(\mu=EX\) exists the sample mean is a consistent estimator of \(\mu\).

Example (4.1.19)

say \(X_1, .., X_n\sim U[0,\theta]\) and \(T_1=2\bar{x}\) . By the WLLN \(\bar{X}\rightarrow E[X]=\theta/2\), so \(T_1=2\bar{X} \rightarrow \theta\) in probability.

Let \(T(\pmb{x})=\frac{n+1}{n}\max\{x_1, .., x_n\}\). Now \(X_i/\theta\sim U[0,1]\), and so

\[M=\max\{X_1, .., X_n\}/\theta\sim Beta(n,1)\] Now

\[ \begin{aligned} &P(\vert T-\theta\vert<\epsilon) = \\ &P(\theta-\epsilon< T<\theta+\epsilon) = \\ &P(\frac{n}{n+1}\left(1-\epsilon/\theta\right)< M/\theta<\frac{n}{n+1}\left(1+\epsilon/\theta\right)) = \\ &P(\frac{n}{n+1}\left(1-\epsilon/\theta\right)< M) = \\ &1-P(M<\frac{n}{n+1}\left(1-\epsilon/\theta\right)) = \\ &1-\left(\frac{n}{n+1}\left(1-\epsilon/\theta\right)\right)^n\rightarrow 1-0=1 \end{aligned} \] if \(1-\epsilon/\theta<1\) or \(\epsilon<\theta\).

Recall that by Chebyshev’s Inequality we have

\[ \begin{aligned} &P(\vert T-\theta\vert>\epsilon) = \\ &P(\left\vert T-\theta\vert>(\frac{\epsilon}{\sigma})\sigma\right) \le \left(\frac{\sigma}{\epsilon}\right)^2 =\sigma^2/\epsilon^2 \end{aligned} \]
Now

\[ \begin{aligned} &\sigma^2 =var(T) = \\ &(\frac{n+1}{n})^2 \theta^2var(M)=\\ &(\frac{n+1}{n})^2 \theta^2\frac{n}{(n+1)^2(n+2)}\\ &\frac{\theta^2}{n(n+2)} \end{aligned} \] and so

\[P(\vert T-\theta\vert>\epsilon) \le \frac{\theta^2}{\epsilon^2n(n+2)}\rightarrow 0\]

This solution works quite often, however it has the disadvantage that it does not give an idea of the rate of convergence. For example, here this solution suggests a rate of \(n^{-2}\) but the previous solution shows that the rate is actually exponential.

Example (4.1.20)

say \(X_1, ..., X_n\sim Geom(p)\) and let \(T(\pmb{x})=1/\bar{x}\) Note \(E[X]=1/p\), and so

\[ \begin{aligned} &P(\vert T-p\vert<\epsilon) = \\ &P(p-\epsilon<1/\bar{X}<p+\epsilon) = \\ &P(\frac1{p+\epsilon}<\bar{X}<\frac1{p+\epsilon}) = \\ &P(\frac1{p+\epsilon}-\frac1{p}<\bar{X}-\frac1{p}<\frac1{p+\epsilon}-\frac1{p}) = \\ &P(-\frac{1}{p(p+\epsilon)}\epsilon<\bar{X}-\frac1{p}<\frac{1}{p(p-\epsilon)}\epsilon) \end{aligned} \] Note that \(p-\epsilon<p+\epsilon\), and so \(\frac1{p+\epsilon}<\frac1{p-\epsilon}\) and \(\frac1{p(p+\epsilon)}<\frac1{p(p-\epsilon)}\), therefore

\[ \begin{aligned} &P(\vert T-p\vert<\epsilon) \ge \\ &P(-\frac{1}{p(p+\epsilon)}\epsilon<\bar{X}-\frac1{p}<\frac{1}{p(p+\epsilon)}\epsilon)\\ &P(\vert \bar{X}-\frac1{p}\vert <\frac{1}{p(p+\epsilon)}\epsilon)\rightarrow 1 \end{aligned} \] by the weak law of large numbers, and so T is a consistent estimator of p.

Example (4.1.21)

say \(X_1, .., X_n\text{ iid }N(\mu,\sigma)\), \(\sigma\) known, and let \(M=\text{median}(\pmb{x})\). Assume wlog that n is odd. Let \(\phi(x;\mu)\) and \(\Phi(x;\mu)\) be the density and cdf of a normal distribution with mean \(\mu\). Then from the theory of order statistics we know that

\[ \begin{aligned} &f_M(x;\mu)= \\ &\frac{n!}{(\frac{n+1}2-1)!(n-\frac{n+1}2)}!\Phi(x;\mu)^{\frac{n+1}2-1}(1-\Phi(x;\mu))^{n-\frac{n+1}2}\phi(x;\mu) = \\ &\frac{n!}{(\frac{n-1}2)!^2}\Phi(x;\mu)^{\frac{n-1}2}(1-\Phi(x;\mu))^{\frac{n-1}2}\phi(x;\mu) \end{aligned} \]

now

\[ \begin{aligned} &P(|M-\mu|>\epsilon) = \\ &1-P(\mu-\epsilon<M<\mu+\epsilon) = \\ &1-\int_{\mu-\epsilon}^{\mu+\epsilon} \frac{n!}{(\frac{n-1}2)!^2}\Phi(x;\mu)^{\frac{n-1}2}(1-\Phi(x;\mu))^{\frac{n-1}2}\phi(x;\mu) dx \\ \end{aligned} \]

We use the change of variable \(t=\Phi(x;\mu)\), so \(t=\phi(x;\mu)dx\). Also let Y be a random variable with distribution Beta\((\frac{n+1}2, \frac{n+1}2)\), then

\[ \begin{aligned} & P(|M-\mu|>\epsilon) = \\ & 1-\frac{n!}{(\frac{n-1}2)!^2}\int_{\Phi(\mu-\epsilon;\mu)}^{\Phi(\mu+\epsilon;\mu)} t^\frac{n-1}2(1-t)^\frac{n-1}2dt = \\ & 1-\frac{n!}{(\frac{n-1}2)!^2}\frac{(\frac{n-1}2!)^2}{n!} \int_{\Phi(\mu-\epsilon;\mu)}^{\Phi(\mu+\epsilon;\mu)} \frac{\Gamma(\frac{n+1}2+\frac{n+1}2)}{\Gamma(\frac{n+1}2)\Gamma(\frac{n+1}2)} t^\frac{n-1}2(1-t)^\frac{n-1}2dt = \\ &1-P(\Phi(\mu-\epsilon;\mu)<Y<\Phi(\mu+\epsilon;\mu)) \end{aligned} \]

Note that \(\Phi(\mu-\epsilon;x)<\frac12\), so

\[\delta=\frac12-\Phi(\mu-\epsilon;x) = \Phi(\mu+\epsilon;x)-\frac12>0\]

also note that

\(E[Y]=\frac{\frac{n+1}2}{\frac{n+1}2+\frac{n+1}2}= \frac12\) and \(var(Y)=\frac{\frac{n+1}2\frac{n+1}2}{(\frac{n+1}2+\frac{n+1}2)^2(\frac{n+1}2+\frac{n+1}2+1)}=\frac1{4(n+2)}\)

and so

\[ \begin{aligned} &P(|M-\mu|>\epsilon) = \\ &1-P(\frac12-\delta<Y<\frac12+\delta) = \\ &P(|Y-\frac12|>\delta) \le \\ &\frac{Var[Y]}{\delta^2} = \frac1{4(n+2)\delta^2}\rightarrow 0 \end{aligned} \]

where the inequality follows from Chebyshev’s inequality,and so the median is a consistent estimator of the mean

Efficiency

Definition (4.1.22)

Say we have a sample \(X_1, ...,X_n\) with density \(f(x\vert \theta)\), and we have two unbiased estimators T₁ and T₂ of \(\theta\). The efficiency of T₁ relative to T₂ is defined by eff(T₁|T₂) = var(T₁)/Var(T₂) and we say that T₁ is more efficient than T₂ if eff(T₁|T₂)<1.

Example (4.1.23)

let’s look again at the example above: we have \(X_1, ...,X_n\sim U[0,\theta]\). We found that \(T_1 = 2\bar{x}\) and \(T_2 = (n+1)/n\max \{x_1, ..,x_n\}\) are unbiased estimators of \(\theta\). Now

\[var(T_1) =var(2\bar{X}) =4var(X_1)/n = 4(\theta^2/12)/n=\theta^2/(3n)\] and

\[ \begin{aligned} &var(T_2) =var(\frac{(n+1)\theta}{n}M/\theta) =\\ &\left(\frac{(n+1)\theta}{n}\right)^2\frac{n}{(n+1)^2(n+2)} = \\ &\frac{\theta^2}{n(n+2)} \end{aligned} \] and so

\[eff(T_1\vert T_2) =var(T_1)/var(T_2) = \frac{\theta^2/(3n)}{\frac{\theta^2}{n(n+2)}}=\frac{n+2}{3}>1\]

for all \(n>1\). So we find that T₂ is more efficient than T₁ for every value of \(\theta\).

Let’s do a little simulation:

n <- 10; theta <- 1; B <- 1000
x <-  matrix(runif(n * B, 0, theta), B, n)
T1 <-  2 * apply(x, 1, mean)
T2 <-  (n + 1)/n * apply(x, 1, max)
df <- data.frame(
  x = c(T1, T2), 
  y = c(rep(1, 1000), rep(2, 1000)))          
ggplot(df, aes(x=x)) + 
    geom_histogram(data = subset(df, y == 1), 
        fill = "red", alpha = 0.2) +
    geom_histogram(data = subset(df, y == 2), 
        fill = "blue", alpha = 0.2)

print("Means")

## [1] "Means"

c(mean(T1), mean(T2))

## [1] 0.9973566 1.0002437

print("Relative Efficiency, estimated and true")

## [1] "Relative Efficiency, estimated and true"

c(var(T1)/var(T2), (n + 2)/3)

## [1] 3.754061 4.000000

Example (4.1.24)

say \(X_1, .., X_n\sim N(\mu,1)\), and let \(T_1=\text{median}(\pmb{x})\) and \(T_2=\bar{x}\). Then \(var(T_2)=1/n\). \(var(T_1) =E[T_1^2] -(E[T_1])^2\). Now

\[ \begin{aligned} &f_M(x) = \\ &\frac{n!}{(\frac{n+1}2-1)!(n-\frac{n+1}2)!}\Phi(x)^{\frac{n+1}2-1}(1-\Phi(x))^{n-\frac{n+1}2}\phi(x) \\ &\frac{n!}{[(n-1)/2]!^2}\left[\Phi(x)(1-\Phi(x))\right]^{(n-1)/2}\phi(x) \\ &E[M^k] = \int_{-\infty}^{\infty} x^k f_M(x) dx\\ \end{aligned} \]

and this seems a bit ugly. Let’s try and do this numerically:

fM <- function(x, n)
    factorial(n)/factorial((n-1)/2)^2*
    (pnorm(x)*(1-pnorm(x)))^((n-1)/2)*dnorm(x)
var.median <- function(n) {
  E1 <- function(x) x*fM(x, n)
  E2 <- function(x) x^2*fM(x, n)
  integrate(E2, -3, 3)$value - 
    integrate(E1, -3, 3)$value^2
}
n <-  2*1:50+1
y <-  0*n
for (i in 1:50) y[i] = var.median(n[i])
df <- data.frame(n=n, y=y/(1/n))
ggplot(data=df, aes(n, y)) +
  geom_point()

round(y[50]/(1/101), 3)

## [1] 1.564

It seems that for large n the relative efficiency is about 1.6.

Generally it is quite possible that one estimator is more efficient than another only for a subset of the parameter space, and the other one is more efficient on the rest. Also, one estimator might be more efficient if n is small but things are reversed if n is large.

An interesting question is whether in a given problem there is an estimator that is more efficient than any other (unbiased) estimator. At least in the sense of minimum variance, such an estimator would be optimal. In order to answer this question we need the following:

Definition (4.1.25)

Let X be a random variable with density \(f(x\vert\theta)\). Then the Fisher Information of \(\theta\) is defined by

\[ I(\theta) = -E\left[\frac{d^2\log f(X\vert \theta)}{d\theta^2}\right] \]

Theorem (4.1.26)

Under suitable conditions

\[ I(\theta) = E\left[\left(\frac{d\log f(X\vert\theta)}{d\theta}\right)^2\right] \]

proof

We will do the proof for a continuous rv. Also we will assume that any interchange of integral and derivative is ok

\[ \begin{aligned} &\frac{\partial^2}{\partial\theta^2}\log f(x\vert\theta) = \\ &\frac{\partial}{\partial\theta} \frac{\frac{\partial}{\partial\theta} f(x\vert\theta)}{f(x\vert\theta)}=\\ &\frac{(\frac{\partial^2}{\partial\theta^2} f(x\vert\theta))f(x\vert\theta)-(\frac{\partial}{\partial\theta} f(x\vert\theta))^2}{f^2(x\vert\theta)} = \\ &\frac{(\frac{\partial^2}{\partial\theta^2} f(x\vert\theta))}{f(x\vert\theta)} - \frac{(\frac{\partial}{\partial\theta} f(x\vert\theta))^2}{f^2(x\vert\theta)} = \\ &\frac{(\frac{\partial^2}{\partial\theta^2} f(x\vert\theta))}{f(x\vert\theta)} - (\frac{\partial}{\partial\theta} \log f(x\vert\theta))^2 \\ \end{aligned} \]

also

\[ \begin{aligned} &E\left[\frac{(\frac{\partial^2}{\partial\theta^2} f(x\vert\theta))}{f(x\vert\theta)} \right] = \\ &\int \frac{(\frac{\partial^2}{\partial\theta^2} f(x\vert\theta))}{f(x\vert\theta)} f(x\vert\theta)dx = \\ &\int \frac{\partial^2}{\partial\theta^2} f(x\vert\theta) dx = \\ &\frac{\partial^2}{\partial\theta^2} \int f(x\vert\theta) dx = \\ &\frac{\partial^2}{\partial\theta^2} 1=0 \end{aligned} \]

Theorem (4.1.27)

Rao-Cramer

Let \(X_1, ...,X_n\) be a sample from density \(f(x\vert\theta)\), and let T be any estimator satisfying

\(\frac{d}{d\theta} E[T(X)] = \int \frac{d}{d\theta} T(x)f(x\vert\theta) dx\)
\(var(T(X))<\infty\)

then

\[var(T(X)) \ge \frac{(\frac{d}{d\theta} E[T(X)])^2}{nI(\theta)}\] The right hand side of this inequality is called the Rao-Cramer Lower Bound.

Note that if T is an unbiased estimator of \(\theta\), we have \(E[T(\pmb{X})]=\theta\) and the numerator is just 1.

Note that the Fisher Information is calculated for a single random variable. The sample size comes in by multiplying with n.

Does it make sense that the second derivative of the log-likelihood should come into play? Consider the following log-likelihood curves:

xbar <- mean(rnorm(100))
ybar <- mean(rnorm(100, 0, 0.1))
curve(dnorm(ybar, x, 0.1/10), -0.3, 0.3, ylab="")
curve(dnorm(xbar, x, 1/10), -0.3, 0.3,
      add=TRUE, col="blue")

so the more peaked the log-likelihood curve is, the smaller the variance of the estimator. But the peakness of a curve is found via the second derivative!

Example (4.1.28)

Let \(X_1, ...,X_n\sim N(\mu,\sigma)\) and consider estimating the standard deviation \(\sigma\), where \(\mu\) is unknown. The normal density satisfies the conditions of the theorem, and in (3.2.7) we found

\[I(\sigma)=-E\left[\frac{d^2\log f(x|\theta)}{d\sigma^2}\right]=1/\sigma^2\]

and so any unbiased estimator T of \(\sigma\) must satisfy

\[var(T) \ge\frac{1}{nI(\sigma)} = \sigma^2/n\]

Example (4.1.29)

Let \(X_1, ..., X_n\sim Pois(\lambda)\). Now

\[ \begin{aligned} &\log f(x|\lambda) =\log \left(\frac{\lambda^x}{x!}e^{-\lambda} \right) = x\log\lambda-\log x! -\lambda\\ &\frac{d \log f(x\vert\lambda)}{d\lambda} =\frac{x}{\lambda}-1 \\ &\frac{d^2 \log f(x\vert\lambda)}{d\lambda^2} =-\frac{x}{\lambda^2} \\ &I(\lambda)=-E\left[\frac{d^2 \log f(X\vert\lambda)}{d\lambda^2}\right] =\\ &-E\left[-\frac{X}{\lambda^2}\right] = \frac{E[X]}{\lambda^2} = \frac{1}{\lambda}\\ \end{aligned} \] and so for any unbiased estimator T we have \(var(T)\ge \lambda/n\).

Note that

\[var(\bar{X})=var(X_1)/n=\lambda/n\] and so the sample mean is a minimum variance unbiased estimator (UMVU) for \(\lambda\).

Example (4.1.30)

Again let’s look at the example of \(U[0,\theta]\) above. There we have \(f(x|\theta) = 1/\theta, 0 < x < \theta\). So

\[E\left[\left(\frac{\partial \log f(X\vert \theta)}{\partial \theta}\right)^2\right]=E\left[\left(\frac{\partial \log 1/\theta}{\partial \theta}\right)^2\right]=(-1/\theta^2)=1/\theta^2\]

so it appears that the Rao-Cramer theorem says that for any unbiased estimator T we have \(var(T) \ge \theta^2/n\), but we have already seen that \(var(T_2) = \theta^2/(n(n+2)) < \theta^2/n\).

So, what goes wrong? Let’s check the condition of the theorem for n=1. Then \(T(X)=2X\), \(\frac{d}{d\theta}E[T(X)]=\frac{d}{d\theta}\theta=1\) but

\[ \begin{aligned} &\int_{-\infty}^\infty \frac{d}{d\theta}T(x)f(x\vert\theta)dx = \\ &\int_{-\infty}^\infty \frac{d}{d\theta}(2xI_{[0,\theta]}(x)\frac1{\theta})dx = \\ &\int_0^\theta \frac{d}{d\theta}(2x\frac1{\theta})dx = \\ &-\frac1{\theta^2}\int_0^\theta 2xdx = \\ &-\frac1{\theta^2}\theta^2 = -1\\ \end{aligned} \]

So here the first assumption of the theorem is not satisfied, something that happens quite often, especially if the parameter is part of the boundary condition, such as \(0 < x < \theta\)

How about checking the theorem in a case where it does work?

Say \(X_1,..,X_n\sim N(\mu,\sigma)\), with \(\sigma\) known, and \(T(\pmb{x})=\bar{\pmb{x}}\). Now

\[ \begin{aligned} &\frac{d}{d\mu}\exp \left\{-\frac1{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2 \right\} = \\ &\exp \left\{-\frac1{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2 \right\}\left(\frac1{\sigma^2} \sum_{i=1}^n (x_i-\mu)\right) \\ &\frac{d}{d\mu} T(\pmb{x})f(\pmb{x}\vert\mu) = \\ &\left(\frac1n \sum_{i=1}^n x_i\right)(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2 \right\}\left(\frac1{\sigma^2} \sum_{i=1}^n (x_i-\mu)\right) =\\ &\frac1{n\sigma^2}\left(\sum_{i=1}^n x_i\right)\left( \sum_{i=1}^n (x_i-\mu)\right)f(\pmb{x}\vert\mu)=\\ &\frac1{n\sigma^2}\left(\sum_{i,j=1}^n x_i(x_j-\mu)\right)f(\pmb{x}\vert\mu) = \\ &\frac1{n\sigma^2}\left(\sum_{i=1}^n x_i(x_i-\mu)f(\pmb{x}\vert\mu)+\sum_{i\ne j}^n x_i(x_j-\mu)f(\pmb{x}\vert\mu)\right) \end{aligned} \]

Now

\[ \begin{aligned} &\int_{-\infty}^\infty...\int_{-\infty}^\infty \sum_{i=1}^n x_i(x_i-\mu)f(\pmb{x}\vert\mu)d\pmb{x} = \\ &\sum_{i=1}^n\int_{-\infty}^\infty...\int_{-\infty}^\infty x_i(x_i-\mu)f(\pmb{x}\vert\mu)d\pmb{x} = \\ &\sum_{i=1}^n\int_{-\infty}^\infty x_i(x_i-\mu) \phi(x_i\vert\mu,\sigma) dx_i = \\ &\sum_{i=1}^n\left(E[X_i^2]-\mu E[X_i]\right) = \\ &\sum_{i=1}^n\left(var(X_i) + E[X_i]^2-E[X_i]^2\right) = n\sigma^2\\ \end{aligned} \]

where \(\phi(x_i\vert\mu,\sigma)\) is the density of a \(N(\mu,\sigma)\).

Also

\[ \begin{aligned} &\int_{-\infty}^\infty...\int_{-\infty}^\infty \sum_{i\ne j}^n x_i(x_j-\mu)f(\pmb{x}\vert\mu)d\pmb{x} = \\ &\sum_{i\ne j}^n\int_{-\infty}^\infty...\int_{-\infty}^\infty x_i(x_j-\mu)f(\pmb{x}\vert\mu)d\pmb{x} = \\ &\sum_{i\ne j}^n\int_{-\infty}^\infty \int_{-\infty}^\infty x_i(x_j-\mu) \phi(x_i\vert\mu,\sigma)\phi(x_j\vert\mu,\sigma) dx_i dx_j = \\ &\sum_{i\ne j}^n\int_{-\infty}^\infty \left\{\int_{-\infty}^\infty (x_j-\mu) \phi(x_i\vert\mu,\sigma)dx_i\right\}x_i\phi(x_j\vert\mu,\sigma) dx_j \\ &\sum_{i\ne j}^n\int_{-\infty}^\infty \left\{E[X_i]-\mu\right\}x_i\phi(x_j\vert\mu,\sigma) dx_j = 0\\ \end{aligned} \] and so

\[\int \frac{d}{d\mu} T(\pmb{x})f(\pmb{x}\vert\mu) d\pmb{x} = \frac1{n\sigma^2}n\sigma^2 =1\]

Robustness

In point estimation we first start by assuming a parametric model for the data, such as \(X_1, ..., X_n\sim N(\mu,\sigma)\), and then try to estimate the parameters of the model. But what if our model is wrong, for example if the true model is a t distribution instead of the Normal? A robust estimator is one that does not depend to strongly on the assumed model.

Example (4.1.31)

Let \(X_1, ..., X_n\sim N(\mu,\sigma)\). It is known that the sample mean is the best estimator of \(\mu\) in the sense that it has the smallest variance of all unbiased estimators. But what happens if our assumption of the normal distribution is wrong?

Let’s consider instead a model called the \(\delta\)-contamination model:

\[X_i\sim\left\{ \begin{array}{ll} N(\mu,\sigma) & \text{with probability }\delta\\ f(x) & \text{with probability }1-\delta\\ \end{array} \right.\]

for some other density f. Suppose first we let f be any density with mean \(\theta\) and variance \(\tau^2\). Let \(Z_i\sim Ber(\delta)\), then

\[ \begin{aligned} &E\{X_i\vert Z_i=0\} =\theta;E\{X_i\vert Z_i=1\} =\mu \\ &E[X_i] = E\left[E\{X_i\vert Z_i\}\right] = \\ &E\{X_i\vert Z_i=0\}P(Z_i=0)+E\{X_i\vert Z_i=1\}P(Z_i=1) = \\ &\theta\delta+\mu(1-\delta) \\ &E[X_i^2] = E\left[E\{X_i^2\vert Z_i\}\right] = \\ &E\{X_i^2\vert Z_i=0\}P(Z_i=0)+E\{X_i^2\vert Z_i=1\}P(Z_i=1) = \\ &(\tau^2+\theta^2)\delta+(\sigma^2+\mu^2)(1-\delta) \\ &var(X_i) = \\ &(\tau^2+\theta^2)\delta+(\sigma^2+\mu^2)(1-\delta) - \left(\theta\delta+\mu(1-\delta)\right)^2 =\\ &(1-\delta)\sigma^2+\delta\tau^2+(1-\delta)\delta(\mu-\theta)^2 \end{aligned} \] and so

\[var(\bar{X})=\left[(1-\delta)\sigma^2+\delta\tau^2+(1-\delta)\delta(\mu-\theta)^2\right]/n\]

Now if f is a Cauchy density we have \(\tau = \infty\) and so the variance of the sample mean is infinite as well!

One way to measure the robustness of an estimator is as follows:

Definition (4.1.32)

Let \(T_n\) be a statistic. \(T_n\) has a breakdown value b if at most \(b\%\) of the values in the sample can be moved to infinity without \(T_n\) becoming infinite.

Example (4.1.33)

Say \(T_n\) is the sample mean. Now

\[T_n = \frac1{n} \sum x_i = \frac1{n} \sum x_{[i]} \rightarrow \infty\]

if \(x_{[n]} = x_{[(1-\epsilon)n]}\rightarrow \infty\), and so the sample mean has a breakdown value of 0.

Example (4.1.34)

Say \(T_n\) is the sample median. Now \(T_n\) has a breakdown value of 1/2.