Methods for Finding Estimators

Method of Moments

Let \(\pmb{x} = (x_1,...,x_n)\) be a sample from a distribution with density \(f(x|\theta_1, ... ,\theta_k)\). Define the ith sample moment by

\[m_i = (x^i_1+..+x^i_n)/n\]

Analogously define the ith population moment by

\[\mu_i = E[X^i]\]

Of course \(\mu_i\) is a function of the \(\theta_1,...,\theta_k\). So we can find estimators of \(\theta_1,...,\theta_k\) by solving the system of k equations in k unknowns

\[m_i=\mu_i\text{; }i=1,..,k\]

Example (4.2.1)

say \(X_1, ...,X_n\sim N(\mu,\sigma)\). Here \(\theta_1=\mu\) and \(\theta_2=\sigma^2\). Then

\[ \begin{aligned} &\mu_1=E[X]=\mu =m_1 =\bar{x}\\ &\mu_2=E[X^2]=\sigma^2+\mu^2=m_2= \overline{x^2} &\hat{\mu}=\bar{x}\\ &\widehat{\sigma^2}=\overline{x^2}-\bar{x}^2 \end{aligned} \]

Example (4.2.2)

say \(X_1, ...,X_n\sim Gamma(\alpha,\beta)\). Then

\[ \begin{aligned} &E[X^k] =\beta^k\prod_{i=1}^k (\alpha+i-1) \\ &E[X] =\alpha\beta=\bar{x} \\ &E[X^2] = (\alpha+1)\alpha\beta^2=\overline{x^2} \\ &(\alpha\beta)^2+(\alpha\beta)\beta=\overline{x^2}\\ &\bar{x}^2+\bar{x}\beta=\overline{x^2}\\ &\hat{\beta} = \left(\overline{x^2}-\bar{x}^2\right)/\bar{x}\\ &\hat{\alpha} = \bar{x}^2/\left(\overline{x^2}-\bar{x}^2\right) \end{aligned} \]

Here is an R calculation:

x <- rgamma(10000, 2.3, 1/5.6)
xbar <-  mean(x)
x2bar <-  mean(x^2)
round(c(xbar^2/(x2bar - xbar^2), x2bar/xbar - xbar), 2)
## [1] 2.26 5.69

Method of Least Squares

Example (4.2.3)

say \(X_1, .., X_n\sim N(\mu,\sigma)\), \(\sigma\) assumed known. Now \(\mu\) is the mean of the normal distribution, so any observations should be scattered around \(\mu\). If we estimate \(\mu\) by say a, then \(\epsilon_i=X_i-a\) is called the ith residual or error. Now a measure of the overall error is

\[G(a) = \sum_{i=1}^n (x_i-a)^2\] and an estimator of the parameter can be found as the value \(\hat{\mu}\) that minimizes G:

\[ \begin{aligned} &\frac{d G(a)}{d a} = 2\sum_{i=1}^n (x_i-a) = 2\sum_{i=1}^n x_i-2na = 0\\ &\hat{\mu} = \frac1n \sum_{i=1}^n x_i = \bar{x} \end{aligned} \]

Obviously we did not need to use \(G(a)= \sum (x_i-a)^2\), other possible choices are

  • \(G(a)= \sum |x_i-a|\), which leads to a=median(\(\pmb{x}\)).

  • \(G(a)=\max\{|x_i-a|\}\), which leads to the mode.

Maximum Likelihood

The idea here is this: the likelihood function gives the likelihood (not the probability!) of a value of the parameter given the observed data, so why not choose the value that “matches” (gives the greatest likelihood) to the observed data.

Example (4.2.4)

say \(X_1, ...,X_n\sim Ber(p)\). First notice that a function f has an extremal point at x iff log(f) does as well because d/dx{log(f(x))}=f’(x)/f(x)=0 iff f’(x)=0.

Let \(y=\sum x_i\), then

\[ \begin{aligned} &L(p|\pmb{x})=f(\pmb{x}|p) =p^y(1-p)^{n-y} \\ &l(p) =\log L(p|\pmb{x}) = \log\left(p^y(1-p)^{n-y}\right) = y\log p+(n-y)\log(1-p)\\ &\frac{d l}{d p} =\frac{y}{p}-\frac{n-y}{1-p}=0 \\ &\hat{p} = y/n = \bar{x} \end{aligned} \]

the second derivative shows that this is indeed a maximum.

Example (4.2.5)

say \(X_1, ..., X_n\sim U[0,\theta]\), \(\theta>0\). Then

\[ \begin{aligned} &L(\theta|\pmb{x})=f(\pmb{x}|\theta) = \prod_{i=1}^n \frac1\theta I_{[0,\theta]}(x_i) = \\ &\theta^{-n} I_{[0,\theta]}(\max\{x_i\}) = \\ &\theta^{-n} I_{[\max\{x_i\},\infty]}(\theta) \end{aligned} \] Here is a graph of this function:

n=10
theta0=2
xmax=max(runif(n, 0, theta0))
round(c(xmax, 1/xmax), 3)
## [1] 1.954 0.512
L=function(t) ifelse(t>=xmax, 1, 0)/t^n
curve(L, 1.5, 2.5)

So \(L(\theta|\pmb{x})\) is 0 on \((0,\max(x_i))\), at \(\max(x_i)\) it jumps to \(1/(\max(x_i))^n\) and then monotonically decreases as \(\theta\) gets bigger, so the maximum is obtained at \(\theta=\max\{x_i\}\), therefore the mle is \(\max\{x_i\}\).

Notice that here log f is of no use because f(x)=0 for values of x close to the point were the maximum is obtained.

Example (4.2.6)

say \(X_1, .., X_k\sim Bin(n,p)\), both p and n unknown. We want to find the mle’s of p and n. We have

\[ \begin{aligned} &L(p,n|\pmb{x})=f(\pmb{x}|p,n) = \prod_{i=1}^k {n\choose x_i} p^{x_i}(1-p)^{n-x_i} \\ &l(p,n) =\log L(p,n|\pmb{x}) = \log\left(\prod_{i=1}^k {n\choose x_i} p^{x_i}(1-p)^{n-x_i}\right) = \\ &\sum_{i=1}^k\log {n\choose x_i}+\left(\sum_{i=1}^k x_i\right) \log p+\left(nk-\sum_{i=1}^k x_i\right) \log (1-p) \end{aligned} \]

now

\[\frac{d l}{dp}=\left(\sum_{i=1}^k x_i\right) \frac1p-\left(nk-\sum_{i=1}^k x_i\right) \frac1{1-p}=0\] and so \(\hat{p}=(\sum_{i=1}^k x_i)/(nk)\), and so for any fixed value of n we have the mle for p. n has to be an integer, and so we only need to search through the values of n for the overall mle. This is done in mle.bin.n. We also need a routine that calculates log(n!) for any n. logfac does this using Sterling’s formula.

logfac <- function(n)
  ifelse(n<20, log(factorial(n)),
    0.918938533205+(n+0.5)*log(n)-n+(1/12-1/(360*n^2))/n)
mle.bin.n <- function (x, Show = F) 
{
    k <-  length(x)
    xbar <-  mean(x)
    f <-  function(n) {
        phat <-  xbar/n
        f <-  k*logfac(n) + n*k*(phat* log(phat) + 
              (1-phat)*log(1-phat))
        for (i in 1:k) f <-  f - logfac(n-x[i])
        f
    }
    n <-  max(max(x) + 2, floor(xbar + 2.5 * var(x)))
    l <-  c(0, 0)
    l[1] <-  f(n)
    l[2] <-  f(n + 1)
    if (Show) 
        print(c(n, l[1], n+1, l[2]))
    if (l[1] > l[2]) {
        repeat {
            n <-  n-1
            if (n == max(x)) 
                return(c(n, mean(x)/n))
            l[2] <-  l[1]
            l[1] <-  f(n)
            if (Show) 
                print(c(n, l[1], n+1, l[2]))
            if (l[1] < l[2]) 
                return(c(n + 1, xbar/(n + 1)))
        }
    }
    repeat {
        n <-  n+1
        l[2] <- f(n + 1)
        if (Show) 
            print(c(n, l[1], n+1, l[2]))
        if (l[1] > l[2]) 
            return(c(n, xbar/n))
        l[1] <-  l[2]
    }
    
}
x <- rbinom(1000, 67, 0.2)
mle.bin.n(x, Show = TRUE)
## [1]    40.00 21713.17    41.00 21714.65
## [1]    41.00 21713.17    42.00 21715.95
## [1]    42.00 21715.95    43.00 21717.09
## [1]    43.00 21717.09    44.00 21718.08
## [1]    44.00 21718.08    45.00 21718.96
## [1]    45.00 21718.96    46.00 21719.73
## [1]    46.00 21719.73    47.00 21720.41
## [1]    47.00 21720.41    48.00 21721.02
## [1]    48.00 21721.02    49.00 21721.55
## [1]    49.00 21721.55    50.00 21722.02
## [1]    50.00 21722.02    51.00 21722.43
## [1]    51.00 21722.43    52.00 21722.80
## [1]    52.00 21722.80    53.00 21723.12
## [1]    53.00 21723.12    54.00 21723.41
## [1]    54.00 21723.41    55.00 21723.66
## [1]    55.00 21723.66    56.00 21723.88
## [1]    56.00 21723.88    57.00 21724.08
## [1]    57.00 21724.08    58.00 21724.25
## [1]    58.00 21724.25    59.00 21724.39
## [1]    59.00 21724.39    60.00 21724.52
## [1]    60.00 21724.52    61.00 21724.63
## [1]    61.00 21724.63    62.00 21724.73
## [1]    62.00 21724.73    63.00 21724.81
## [1]    63.00 21724.81    64.00 21724.87
## [1]    64.00 21724.87    65.00 21724.93
## [1]    65.00 21724.93    66.00 21724.97
## [1]    66.00 21724.97    67.00 21725.01
## [1]    67.00 21725.01    68.00 21725.03
## [1]    68.00 21725.03    69.00 21725.05
## [1]    69.00 21725.05    70.00 21725.06
## [1]    70.00 21725.06    71.00 21725.07
## [1]    71.00 21725.07    72.00 21725.07
## [1] 71.0000000  0.1905634

Example (4.2.7)

\(X_1, .., X_n\sim N(\mu,\sigma)\):

\[ \begin{aligned} &l(\mu,\sigma) = -\frac{n}2\log(2\pi)-n\log\sigma-\frac1{2\sigma^2}\sum(x_i-\mu)^2\\ &\frac{dl(\mu,\sigma)}{d\mu} = \frac1{\sigma^2}\sum(x_i-\mu)=(\sum x_i-n\mu)/\sigma^2=0\\ &\hat{\mu} =\bar{x} \\ &\frac{dl(\mu,\sigma)}{d\sigma} =-\frac{n}{\sigma}+ \frac1{\sigma^3}\sum(x_i-\mu)^2=0\\ &\hat{\sigma}=\sqrt{\frac1n \sum(x_i-\bar{x})^2 } \end{aligned} \]

Example (4.2.8)

We have observations \(X_1, ...,X_n\) which are independent. We know that our population is made up of two groups (Men - Women, say) and each observation comes from one or the other group but we don’t know which. Observations from group i have a \(N(\mu_i,\sigma_i)\), i=1,2, distribution. We want to estimate the parameters.

What we have here is called a mixture distribution. Say that proportion of members of group 1 in the population is \(\alpha\). Let’s introduce a latent (unobservable) r.v. \(Z_i\), which is 1 if observation \(X_i\) comes from group 1, and 2 if it comes from 2. Then \[ \begin{aligned} &F(x) =P(X_i\le x)= \\ &P\left(X_i\le x\vert Z_i=1\right)P(Z_i=1)+P\left(X_i\le x\vert Z_i=2\right)P(Z_i=2) = \\ &\Phi(x|\mu_1,\sigma_1)\alpha+\Phi(x|\mu_2,\sigma_2)(1-\alpha) \\ &f(x) = \alpha\phi(x|\mu_1,\sigma_1)+(1-\alpha)\phi(x|\mu_2,\sigma_2) \\ &L(\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\vert\pmb{x}) = \prod_{i=1}^n \left[\alpha\phi(x|\mu_1,\sigma_1)+(1-\alpha)\phi(x|\mu_2,\sigma_2) \right]\\ &l(\alpha,\mu_1,\sigma_1,\mu_2,\sigma_2\vert\pmb{x}) = \sum_{i=1}^n\log \left[\alpha\phi(x|\mu_1,\sigma_1)+(1-\alpha)\phi(x|\mu_2,\sigma_2) \right]\\ \end{aligned} \]

where we use the notation \(\Phi (x|\mu,\sigma)\) for the cdf of a \(N(\mu,\sigma)\) r.v and \(\phi (x|\mu,\sigma)\) for its density.

Unfortunately this expression does not simplify! Also, it is a function in 5 dimensions, so just looking at it with a graph is difficult.

To start let’s keep it simple and assume we know \(\mu_1\), \(\sigma_1\), \(\mu_2\) and \(\sigma_2\) and we want to estimate \(\alpha\). Then

\[\frac{dl}{d\alpha}=\sum_{i=1}^n \frac{\phi(x|\mu_1,\sigma_1)-\phi(x|\mu_2,\sigma_2)}{\alpha\phi(x|\mu_1,\sigma_1)+(1-\alpha)\phi(x|\mu_2,\sigma_2)}=0\]

This is a non-linear equation, which can not be solved explicitly, so we will have to do it numerically. A standard method in numerical analysis for solving equations of the form h(x)=0 is Newton’s method:

pick a starting point \(x_1\), find

\[x_{n+1} = x_n - \frac{h(x_n)}{h'(x_n)}\]

If the starting point is close enough to a solution of the equation, the sequence will converge to it.

Let’s implement Newton’s method for this problem. To simplify the notation set \(\phi_i= \phi (x_i,\mu_2,\sigma_2)\) and \(\psi_i= \phi (x_i,\mu_2,\sigma_2)- \phi (x_i,\mu_1,\sigma_1)\), then

\[h(a)=\sum_{i=1}^n \frac{\psi_i}{\alpha\psi+\phi_i}\] and

\[h'(a)=-\sum_{i=1}^n \frac{\psi_i^2}{(\alpha\psi+\phi_i)^2}\] Let’s implement this:

mixmle1 <- function(x, mu, sigma) {
    phi <- dnorm(x, mu[2], sigma[2])
    psi <- dnorm(x, mu[1], sigma[1]) - phi
    anew <- 0.5
    repeat {
        aold <- anew
        h <- sum(psi/(aold*psi+phi))
        hprime <- -sum(psi^2/(aold*psi+phi)^2)
        anew <- aold - h/hprime
        if (abs(anew - aold) < 10^-5) 
            break
    }
    anew
}
n <- 1000; alpha <- 0.3
mu <- c(0, 5); sigma <- c(1, 1)
z <- sample(c(1, 2), size=n, replace=TRUE, 
            prob=c(alpha, 1-alpha))
x <- c(rnorm(n, mu[1], sigma[1])[z == 1], 
       rnorm(n, mu[2], sigma[2])[z == 2])
alphahat <- mixmle1(x, mu, sigma)
alphahat
## [1] 0.2977849
f <- function(x) 
  alphahat*dnorm(x, mu[1], sigma[1]) +
  (1-alphahat)*dnorm(x, mu[2], sigma[2])
df <- data.frame(x=x)
bw <- diff(range(x))/50 
ggplot(df, aes(x)) +
  geom_histogram(aes(y = ..density..),
    color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Density") +
  stat_function(fun = f, colour = "blue")

Next let’s consider the case where we know \(\alpha\), \(\sigma_1\) and \(\sigma_2\) and want to estimate \(\mu_1\) and \(\mu_2\). For this we need a multivariate extension of Newton’s method. Say \(h(\pmb{x})\) is a real-valued function in \(\mathbb{R}^n\), and we wish to find a maximum (or more generally an extremal point) of h. Let \(\Delta\) h be the gradient of h, that is

\[\Delta h_i(x) = \frac{\partial h(x)}{\partial x_i}\]

and let H be the Hessian matrix defined by

\[H_{ij}(x) = \frac{\partial^2 h(x)}{\partial x_i \partial x_j}\]

then

\[x_{n+1} = x_n - H^{-1}(x_n) \Delta h(x_n)\]

Here this means:

\[ \begin{aligned} &\frac{d\phi}{d\mu} =\frac{d}{d\mu}\left[\frac{1}{\sqrt{2\pi\sigma^2}}\exp \left\{-\frac{(x-\mu)^2}{2\sigma^2} \right\}\right]= \\ &\frac{1}{\sqrt{2\pi\sigma^2}}\exp \left\{-\frac{(x-\mu)^2}{2\sigma^2} \right\}\frac{x-\mu}{\sigma^2}=\frac{x-\mu}{\sigma^2}\phi(x)\\ &\frac{d^2\phi}{d\mu^2} = \frac{-1}{\sigma^2}\phi(x)+\left(\frac{x-\mu}{\sigma^2}\right)^2\phi(x) = \\ &\left(\frac{(x-\mu)^2-\sigma^2}{\sigma^4}\right)\phi(x) \end{aligned} \]

Again we use some short-cut notation: set \(\phi_i^k= \phi (x_i,\mu_k,\sigma_k)\) and \(\psi_i= \phi (x_i,\mu_1,\sigma_1)- \phi (x_i,\mu_2,\sigma_2)\), then

\[ \begin{aligned} &h_1(\mu_1,\mu_2) = \frac{\alpha}{\sigma_1^2}\sum_{i=1}^n \frac{(x_i-\mu_1)\phi_i^1}{\psi_i}\\ &h_2(\mu_1,\mu_2) = \frac{(1-\alpha)}{\sigma_2^2}\sum_{i=1}^n \frac{(x_i-\mu_2)\phi_i^2}{\psi_i}\\ &H[1,1] = \frac{\alpha}{\sigma_1^4}\sum_{i=1}^n \frac{[(x_i-\mu_1)-\sigma_1^2]\phi_i^1\psi_i-\alpha(x_i-\mu_1)^2(\phi_i^1)^2}{\psi_i^2} \\ &H[1,2]=H[2,1]=\frac{\alpha(1-\alpha)}{\sigma_1^2\sigma_2^2}\sum_{i=1}^n \frac{(x_i-\mu_1)(x_i-\mu_2\phi_i^1\phi_i^2}{\psi_i^2}\\ &H[2,2]=\frac{1-\alpha}{\sigma_2^4}\sum_{i=1}^n \frac{[(x_i-\mu_2)-\sigma_2^2]\phi_i^2\psi_i-(1-\alpha)(x_i-\mu_2)^2(\phi_i^2)^2}{\psi_i^2} \end{aligned} \] Also note that if

\[ \pmb{A} = \begin{pmatrix} a & b \\ b & c \\ \end{pmatrix} \] then

\[ \pmb{A}^{-1} = \frac1{ac-b^2} \begin{pmatrix} c & -b \\ -b & a \\ \end{pmatrix} \] and so we have

mixmle2 <- function (x, alpha, sigma) {
  h <- c(0, 0)
  H <- matrix(0, 2, 2)
  munew <- c(1, 4)
  repeat {
    muold <- munew
    phi1 <- dnorm(x, muold[1], sigma[1])
    phi2 <- dnorm(x, muold[2], sigma[2])
    psi <- alpha*phi1 + (1-alpha)*phi2
    print(round(c(munew, sum(log(psi))), 3))
    h[1] <- alpha/sigma[1]^2*sum((x-muold[1])*phi1/psi)
    h[2] <- (1-alpha)/sigma[2]^2*sum((x-muold[2])*phi2/psi)
    H[1, 1] <- alpha/sigma[1]^4*sum(((x-muold[1]-sigma[1]^2) * 
                      phi1 * psi - alpha * (x - muold[1])^2 * phi1^2)/psi^2)
    H[1, 2] <- alpha*(1-alpha)/sigma[1]^2 * sigma[2]^2 * 
            sum((x - muold[1]) * (x - muold[2]) * phi1 *  phi2/psi^2)
    H[2, 1] <- H[1, 2]
    H[2, 2] <- (1-alpha)/sigma[2]^4 * sum(((x-muold[2] - 
            sigma[2]^2)*phi2*psi-(1-alpha) * (x - muold[2])^2 * phi2^2)/psi^2)
    Hinf <- cbind(c(H[2, 2], -H[1, 2]), c(-H[1, 2], 
                    H[1, 1]))/(H[1, 1] * H[2, 2] - H[1, 2]^2)
    munew <- muold - Hinf %*% h
    if (sum(abs(munew - muold)) < 10^-5) 
        break
  }
  round(c(munew), 3)
}
mixmle2(x, alpha, sigma)
## [1]     1.000     4.000 -2504.643
## [1]     0.743     4.499 -2177.276
## [1]     0.511     4.800 -2059.271
## [1]     0.324     4.921 -2023.829
## [1]     0.188     4.969 -2011.291
## [1]     0.098     4.989 -2006.914
## [1]     0.044     4.996 -2005.517
## [1]     0.014     4.998 -2005.104
## [1]    -0.002     4.999 -2004.987
## [1]    -0.010     4.999 -2004.954
## [1]    -0.015     4.998 -2004.945
## [1]    -0.017     4.998 -2004.943
## [1]    -0.019     4.998 -2004.942
## [1]    -0.019     4.998 -2004.942
## [1]    -0.020     4.998 -2004.942
## [1]    -0.020     4.998 -2004.942
## [1]    -0.020     4.997 -2004.942
## [1]    -0.020     4.997 -2004.942
## [1]    -0.020     4.997 -2004.942
## [1]    -0.020     4.997 -2004.942
## [1]    -0.020     4.997 -2004.942
## [1] -0.020  4.997

How about the complete problem with 5 parameters? This can be done but is quite an effort.

Example (4.2.9)

Say \(X_1,..,X_n\) are iid with density \(h(x\vert\beta) =0.95I_{[0,1]}(x)+0.05I_{[\beta,\beta+0.1]}(x)\), so \(X\sim U[0,1]\) with probability 0.95 and \(X\sim U[\beta,\beta+0.1]\) with probability 0.05. Now

\[l(\beta\vert\pmb{x})=\sum\log \left[0.95I_{[0,1]}(x)+0.05I_{[\beta,\beta+0.1]}(x)\right]\] and here is what this looks like:

a <- 0.95; b <- 0.5
x <-  c(runif(1000*a), runif(1000*(1-a), b, b+0.1))
z <-  seq(0, 0.9, length = 100)
loglike <-  function(b) 
    sum(log(a+(1-a)*10*ifelse(x>b & x<b+0.1, 1, 0)))
y <-  z
for (i in 1:100) y[i] <-  loglike(z[i])
df <- data.frame(x=x, y=y)
ggplot(data=df, aes(x, y)) +
  geom_line()

and we see that it has many local minima. Moreover, it is not differentiable as a function of \(\beta\). So finding the mle is a very difficult task.

Example (4.2.10)

say X has a multinomial distribution with parameters \(p_1,..,p_k\) (we assume m is known), then if we simply find the derivatives of the log-likelihood we find

\[ \begin{aligned} &\frac{\partial}{\partial p_i}l(p_1,..p_k) = \\ &\frac{\partial}{\partial p_i} \left[c+\sum_{j=1}^k x_j\log p_j\right] = \\ &\frac{x_i}{p_i} =0 \\ \end{aligned} \]

and this system has no solution. The problem is that we are ignoring the condition \(p_1+..+p_k=1\). So we really have the problem

Maximize \(l(p_1,..,p_k)\) subject to \(p_1+..+p_k=1\)

One way to do this is with the method of Lagrange multipliers: minimize

\(l(p_1,..,p_k)- \lambda (p_1+..+p_k-1)\)

\[ \begin{aligned} &\frac{\partial}{\partial p_i} l(p_1,..p_k)+\lambda(p_1+..p_k-1) = \\ &\frac{\partial}{\partial p_i} \left[c+\sum_{j=1}^k x_j\log p_j +\lambda(p_1+..p_k-1)\right] = \\ &\frac{x_i}{p_i} -\lambda =0 \\ &x_i=\lambda p_i\\ &m=\sum_{j=1}^k x_j=\lambda \sum_{j=1}^k p_j=\lambda\\ &\hat{p_i}=\frac{x_i}{m} \end{aligned} \]

Properties of mle’s

Maximum likelihood estimators have a number of nice properties. One of them is their invariance under transformations. That is if \(\hat{\theta}\) is the mle of \(\theta\), then \(g(\hat{\theta})\) is the mle of \(g(\theta)\).

Example (4.2.11)

say \(X_1, .., X_n\sim Ber(p)\), so we know that the mle is \(\bar{X}\) . Say we are interested in

\[\theta=p-q=p-(1-p)=2p-1\]

the difference in proportions. Therefore \(2\bar{X} -1\) is the mle of \(\theta\).

Let’s see whether we can verify that. First if \(\theta=2p-1\) we have \(p=(1+\theta)/2\). Let \(y=\sum x_i\), so

\[ \begin{aligned} &L(p\vert \pmb{x}) = p^y(1-p)^{n-y}\\ &L(\theta\vert\pmb{x}) = \left(\frac{1+\theta}2\right)^y\left(1-\frac{1+\theta}2\right)^{n-y} =\\ &\left(\frac{1+\theta}2\right)^y\left(\frac{1-\theta}2\right)^{n-y} =(1+\theta)^y(1-\theta)^{n-y}/2^n\\ &l(\theta\vert\pmb{x}) = y\log(1+\theta)+(n-y)\log(1-\theta)-n\log 2\\ &\frac{dl}{d\theta} = \frac{y}{1+\theta}-\frac{n-y}{1-\theta} = 0\\ &\hat{\theta}=2(y/n)-1=2\bar{x}-1 \end{aligned} \]

Example (4.2.12)

say \(X_1, .., X_n\sim N(\mu,\sigma)\). We found before that the mle of \(\hat{\sigma}=\sqrt{\frac1n\sum(x_i-\bar{x})^2}\). But then the mle of the variance is

\[\widehat{\sigma^2}=\frac1n\sum(x_i-\bar{x})^2\]

Theorem (4.2.13)

Let \(X_1, .., X_n\) be iid \(f(x|\theta)\). Let \(\hat{\theta}\) denote the mle of \(\theta\), and let \(g(\theta)\) be a continuous function of \(\theta\). Under some regularity conditions on f we have

\[ \sqrt{n}\left[ g(\hat{\theta})-g(\theta) \right]\rightarrow N\left(0, \sqrt{v(\theta)}\right) \]

where \(v(\theta)\) is the Rao-Cramer lower bound. That is, \(g(\hat{\theta})\) is a consistent and asymptotically efficient estimator of \(g(\theta)\).

Example (4.2.14)

say \(X_1, .., X_n\sim N(\mu, \sigma)\), \(\sigma\) known. We know that the mle is \(\bar{x}\), which is unbiased. Now

\[ \begin{aligned} &\phi(x;\mu) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left\{-\frac1{2\sigma^2}(x-\mu)^2 \right\} \\ &\log f(\mu) = K-\frac1{2\sigma^2}(x-\mu)^2 \\ &\frac{d\log f}{d\mu} = \frac1{\sigma^2}(x-\mu) \\ &\frac{d^2\log f}{d\mu^2} = -\frac1{\sigma^2} \\ &v(\mu) = \frac1{n\frac1{\sigma^2}} = \frac{\sigma^2}{n} \end{aligned} \] and so

\[ \bar{X}-\mu \sim N(0, \frac{\sigma}{\sqrt{n}}) \]

Example (4.2.15)

say \(X_1, .., X_n\sim Ber(p)\) and we want to estimate \(\theta=2p-1\). We saw above that the mle is given by \(2\bar{X}-1\). We have

\[E[\bar{X} -1]=2p-1=\theta\]

and so \(2\bar{x}-1\) is an unbiased estimator of \(\theta\). Now

\[ \begin{aligned} &var(2\bar{X}-1) = 4var(X_1)/n = 4p(1-p)/n =\\ &4\left(\frac{1+\theta}2\right)\left(\frac{1-\theta}2\right)/n =(1+\theta)(1-\theta)/n \end{aligned} \]

Note that \(E[X]=p=\frac{1+\theta}2\). Now

\[ \begin{aligned} &f(x\vert\theta) = \left(\frac{1+\theta}2\right)^x\left(\frac{1-\theta}2\right)^{1-x}\\ &\log f(x\vert\theta) =x\log(1+\theta)+(1-x)\log(1-\theta)-\log 2 \\ &\frac{d\log f(x\vert\theta)}{d\theta} = \frac{x}{1+\theta}-\frac{1-x}{1-\theta}\\ &\frac{d^2\log f(x\vert\theta)}{d\theta^2} = -\frac{x}{(1+\theta)^2}-\frac{1-x}{(1-\theta)^2}\\ &E\left[\frac{d^2\log f(X\vert\theta)}{d\theta^2}\right] = -\frac{E[X]}{(1+\theta)^2}-\frac{E[1-X]}{(1-\theta)^2} = \\ &-\frac{\frac{1+\theta}2}{(1+\theta)^2}-\frac{\frac{1-\theta}2}{(1-\theta)^2}=\\ &-\frac1{(1+\theta)(1-\theta)} \end{aligned} \]

so \(v(\theta)=(1+\theta)(1-\theta)\) and

\[2\bar{X}-1-\theta\sim N\left(0, \sqrt{(1+\theta)(1-\theta)/n}\right)\]

Bayesian Point Estimation

We have already seen how to use a Bayesian approach to do finding point estimators, namely using the mean of the posterior distribution. Of course one could also use the median or any other measure of central tendency. A popular choice for example is the mode of the posterior distribution.

Example (4.2.16)

Let’s say we have \(X_1, .., X_n\sim Ber(p)\) and \(p\sim Beta(\alpha,\beta)\), then we already know that

\[p|x_1,..x_n\sim Beta(\alpha+\sum x_i,n-\sum x_i+ \beta )\]

and so we can estimate p as follows:

  • Mean \(\hat{p}=\frac{\alpha+\sum x_i}{\alpha+\beta+n}\)

  • Median \(\hat{p}=qbeta(0.5,\alpha+\sum x_i,n-\sum x_i+ \beta)\)

  • Mode

This is the point where the posterior density has its maximum, and it is easy to verify that \(\hat{p}=\frac{\alpha+\sum x_i-1}{\alpha+\beta+n+3}\)

As \(n\rightarrow\infty\) the posterior mean and mode clearly approach \(\sum x_i/n=\bar{x}\). In fact so does the median, though that is somewhat more complicated to show.

Example (4.2.17)

Let’s say we have \(X_1, .., X_n\sim N(\mu,\sigma)\) and we want to estimate both \(\mu\) and \(\sigma\). So we need priors on both parameters:

  • \(\mu\): we use the improper prior \(g(\mu)=1\)

  • \(\sigma\): we use Jeffrey’s prior \(\pi(\sigma)=1/\sigma\)

and we will assume that the priors for \(\mu\) and \(\sigma\) are independent. We then get the joint prior on \((\mu,\sigma)\) to be proportional to \(1/\sigma\). Therefore

\[ \begin{aligned} &f(\pmb{x}\vert\mu,\sigma) =(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^n\left(x_i-\mu\right)^2 \right\} \\ &f(\pmb{x},\mu,\sigma) =(2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\sum_{i=1}^n\left(x_i-\mu\right)^2 \right\}\times\frac1{\sigma} =\\ &(2\pi)^{-n/2}\sigma^{-n-1}\exp \left\{-\frac1{2\sigma^2}\left[\sum_{i=1}^n\left(x_i-\bar{x}\right)^2+n(\bar{x}-\mu)^2\right] \right\} =\\ &(2\pi)^{-n/2}\sigma^{-n-1}\exp \left\{-\frac{(n-1)s^2}{2\sigma^2} \right\}\exp \left\{-\frac{n(\mu-\bar{x})^2}{2\sigma^2} \right\} \end{aligned} \]

therefore

\[ \begin{aligned} &f(\mu,\sigma\vert\pmb{x}) \propto \sigma^{-n-1}\exp \left\{-\frac{(n-1)s^2}{2\sigma^2} \right\}\exp \left\{-\frac{n(\mu-\bar{x})^2}{2\sigma^2} \right\}\\ &f(\mu\vert\pmb{x},\sigma) \propto \exp \left\{-\frac{n(\mu-\bar{x})^2}{2\sigma^2} \right\} \end{aligned} \]

and so \(\mu\vert\pmb{x},\sigma\sim N(\bar{x},\sigma/\sqrt n)\), and so if we use the mean of the posterior distributions we get the sample mean as the estimator of the population mean.

How about \(\sigma\)? The marginal of \(\sigma^2\) turns out to be a scaled inverse-\(\chi^2\) distribution, that is the distribution of \(1/Z\) where \(Z\sim \chi^2\) and its mean is the sample standard deviation \(s^2\).

We see that with these priors Bayesian and Frequentist (maximum likelihood) estimators are the same. If we use these “flat” priors that often turns out to be the case.