mle

Parameter Estimation

In this section we will study the problem of parameter estimation. In its most general form this is as follows: we have a sample \(X_1,..,X_n\) from some probability density \(f(x;\theta)\). Here both x and \(\theta\) might be vectors. Also we will use the term density for both the discrete and the continuous case.

The problem is to find an estimate of \(\theta\) based on the data \(X_1,..,X_n\), that is a function (called a statistic) \(T(X_1,..,X_n)\) such that in some sense \(T(X_1,..,X_n) \approx \theta\).

Example: Binomial proportions

in a survey 567 people 235 said they prefer Coke over Pepsi. What is the percentage of people who prefer Coke over Pepsi?

The answer seems obvious: 235/567. However, let’s work this out in detail. First, each person is a Bernoulli trial (yes/no) with some success probability \(\pi\). So we have the density

\[ P(X_i=1)=1-P(X_i=0)=\pi \]

which we can write as

\[ f(x)=\pi^x(1-\pi)^{1-x} \text{, }x=0,1 \]

the joint density is given by

\[ \begin{aligned} &f(x_1, .., x_n)= \\ &\prod_{i=1}^n \pi^x_i(1-\pi)^{1-x_i} = \\ & \pi^{\sum x_i}(1-\pi)^ {\sum(1-x_i)} = \\ & \pi^y(1-\pi)^{n-y} \\ \end{aligned} \] where \(y=\sum x_i\) is the number of successes.

In our case this becomes \(\pi^{235}(1-\pi)^{332}\) and the task is to estimate \(\pi\).

Likelihood function

say we have \(X_1,..,X_n \sim f(x;\theta)\) and independent. Then the joint density of \(\mathbf{X}= (X_1,..,X_n)\) is given by

\[ f(\mathbf{x};\theta)=\prod_{i=1}^n f(x_i;\theta) \]

The likelihood function L is defined by

\[ L(\theta;\mathbf{x})=\prod_{i=1}^n f(x_i;\theta) \]

this does not seem to be much: the right hand side is the same. However, it is a very different expression: in \(f(\mathbf{x};\theta)\) \(\mathbf{x}\) is the variable and \(\theta\) is an (unknown) constant. In \(L(\theta;\mathbf{x})\) \(\theta\) is the variable and \(\mathbf{x}\) is a (known) constant.

It is the essential difference of before the experiment, when one might ask questions of probability, and after the experiment, when one asks questions of statistics.

Closely related is the log-likelihood function

\[ l(\theta;\mathbf{x})=\log L(\theta;\mathbf{x})=\sum_{i=1}^n \log f(x_i;\theta) \] The log-likelihood is often easier to work with, not the least because it turns a product into a sum.

There is a principle in Statistics that suggests that any inference should always be based on the likelihood function.

Example: Binomial proportions

We already found the joint density

\[ \pi^y(1-\pi)^{n-y} \]

and so the log likelihood is

\[ \begin{aligned} &l(\pi;y,n) = y\log \pi +(n-y) \log (1-\pi)\\ \end{aligned} \]

Example: Normal mean

Say \(X_i \sim N(\mu, \sigma)\) so

\[ \begin{aligned} &l(\mu;\mathbf{x}, \sigma) = \\ &\sum_{i=1}^n \log \left[ \frac1{\sqrt{2\pi \sigma^2}} \exp \left\{- \frac1{2\sigma^2} (x_i-\mu)^2 \right\} \right]= \\ &\frac{n}2 \log (2\pi \sigma^2) - \frac1{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2 \\ \end{aligned} \] Now let \(\bar{x}=\frac1n \sum x_i\), then

\[ \begin{aligned} &\sum_{i=1}^n (x_i-\mu)^2 = \\ &\sum_{i=1}^n (x_i-\bar{x}+\bar{x}-\mu)^2 = \\ & \sum_{i=1}^n (x_i-\bar{x})^2 + 2\sum_{i=1}^n (x_i-\bar{x})(\bar{x}-\mu)+\sum_{i=1}^n (\bar{x}-\mu)^2 = \\ & \sum_{i=1}^n (x_i-\bar{x})^2 + 2(\bar{x}-\mu)\sum_{i=1}^n (x_i-\bar{x})+n(\bar{x}-\mu)^2 = \\ & \sum_{i=1}^n (x_i-\bar{x})^2 + 2(\bar{x}-\mu)(\sum_{i=1}^n x_i-n\bar{x})+n(\bar{x}-\mu)^2 = \\ & \sum_{i=1}^n (x_i-\bar{x})^2+n(\bar{x}-\mu)^2 \end{aligned} \] and so

\[ l(\mu;\mathbf{x}, \sigma) = \text{const}-\frac{1}{2\sigma^2/n} (\mu-\bar{x})^2 \] so as a function of \(\mu\) the log-likelihood is a quadratic function with vertex at \(\bar{x}\)

Maximum Likelihood estimation

The idea of maximum likelihood estimation is to find that value of \(\theta\) that maximizes the likelihood function. In an obvious way, this is the value of the parameter that best “agrees”" with the data.

Of course a function \(L\) has a maximum at x iff \(\log L\) has a maximum at x, so we can also (and easier!) maximize the log-likelihood function

Example: Normal mean

\[ \frac{dl(\mu;\mathbf{x}, \sigma)}{d\mu} = -\frac{1}{\sigma^2/n} (\mu-\bar{x})=0 \] so \(\hat{\mu} = \bar{x}\)

This is of course a maximum, and not a minimum or an inflection point because \(-\frac{1}{\sigma^2/n}<0\).

Example: Binomial proportion

\[ \begin{aligned} &\frac{dl}{d\pi}=\frac{y}{\pi}-\frac{n-y}{1-\pi} =0 \\ &\hat{\pi} =\frac{y}{n} \end{aligned} \] and for our numbers we find \(\hat{\pi}=235/527=0.4459\)

Numerical Computation

The above calculations require some calculus. Sometimes we can let R take care of this for us:

ll <- function(pi, y, n) 
  log(dbinom(y, n, pi))
pi <- seq(0.4, 0.491, length=1000)
df <- data.frame(pi=pi, ll=ll(pi, 235, 527))
mle <- df$pi[which.max(df$ll)]
mle

## [1] 0.4459099

ggplot(df, aes(x=pi, y=ll)) + 
  geom_line(size=1.5, col="blue") +
  geom_vline(xintercept = mle, size=2)

notice that the log-likelihood curve looks a lot like a parabola. This is not an accident, and it will come in handy soon!

Example: Beta distribution

A random variable is said to have a Beta density if

\[ f(x;\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}\text{, }0<x<1 \] here \(\Gamma\) is the famous gamma function

\[ \Gamma(x)=\int_0^{\infty} t^{x-1}e^{-t}dt \]

Let’s simulate a sample from a Beta distribution

x <- sort(round(rbeta(500, 2, 4), 3))
beta.ex <- data.frame(x=x)

bw <- diff(range(x))/50 
ggplot(beta.ex, aes(x)) +
  geom_histogram(aes(y = ..density..),
    color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Density")

and we want to estimate \(\alpha, \beta\).

Now doing this with calculus is out because the log-likelihood function doesn’t exist in closed form. Instead we will need to use a numerical method. Because the Beta distribution is a standard one, there is an R routine to do it for us. It is part of the package

library(MASS)
fitdistr(x, 
         densfun="Beta",
         start=list(shape1=1,shape2=1))

##     shape1      shape2  
##   2.0716029   4.2045833 
##  (0.1227766) (0.2646915)

Example: linear density

here are observations from a linear density \(f(x|a)=2ax+1-a\), \(0<x<1\) and \(-1<a<1\):

0.005 0.011 0.025 0.03 0.031 0.031 0.05 0.059 0.061 0.064 0.067 0.068 0.072 0.075 0.082 0.083 0.084 0.101 0.102 0.104 0.106 0.112 0.114 0.117 0.125 0.136 0.137 0.143 0.145 0.146 0.17 0.174 0.175 0.186 0.197 0.198 0.209 0.209 0.209 0.229 0.234 0.234 0.242 0.256 0.269 0.275 0.279 0.281 0.283 0.293 0.306 0.311 0.311 0.313 0.328 0.338 0.347 0.358 0.362 0.371 0.381 0.384 0.392 0.406 0.409 0.429 0.431 0.44 0.447 0.447 0.45 0.455 0.456 0.458 0.485 0.492 0.494 0.498 0.503 0.506 0.507 0.535 0.55 0.559 0.561 0.577 0.584 0.586 0.587 0.594 0.597 0.598 0.599 0.604 0.608 0.616 0.623 0.625 0.638 0.641 0.644 0.665 0.667 0.676 0.704 0.722 0.73 0.731 0.731 0.731 0.733 0.735 0.738 0.742 0.742 0.743 0.746 0.75 0.751 0.755 0.766 0.768 0.792 0.795 0.796 0.804 0.812 0.821 0.834 0.837 0.837 0.861 0.865 0.873 0.878 0.88 0.886 0.897 0.916 0.923 0.928 0.94 0.944 0.948 0.959 0.961 0.962 0.969 0.972 0.974

We want to estimate a. So let’s see:

\[ \begin{aligned} &f(x|a) = \prod_{i=1}^n \left[2ax_i+1-a \right] =\\ &l(a) =\sum_{i=1}^n \log \left[2ax_i+1-a \right] \\ &\frac{dl}{da}=\sum_{i=1}^n\frac{2x_i-1}{2ax_i+1-a}=0 \end{aligned} \] and this equation can not be solved analytically. Unfortunately this is not one of the distributions included in fitdistr, so we need to find a numerical solution ourselves.

Here are several:

Simple Grid Search

Let’s draw the curve of \(\frac{dl}{da}\). In what follows x is the data above.

f <- function(a) {
  y <- a
  for(i in seq_along(a)) 
    y[i] <- sum( (2*x-1)/(2*a[i]*x+1-a[i]) )
  y
}  
a <- seq(-0.5, 0.5, length=1000)
y <- f(a)
# find value of a where y is closest to 0
mle <- a[which.min(abs(y))]
df <- data.frame(a=a, y=y)
ggplot(df, aes(a, y)) +
  geom_line(color="blue", size=1.5) +
  geom_hline(yintercept=0) +
  geom_vline(xintercept=mle)

round(mle, 4)

## [1] -0.1176

Bisection Algorithm

The idea is this: our function is positive for a=-0.5 and negative for a=0.5. It is also continuous and decreasing. So we can find the zero by checking midpoints and adjusting the upper or lower limit accordingly:

low <- (-0.5)
high <- 0.5
repeat {
  mid <- (low+high)/2
  y <- f(mid)
  cat(round(mid, 4), "  ", round(y, 4),"\n")
  if(y>0) low <- mid
  else high <- mid
  if(high-low<0.0001) break  
}

0    -5.962 
-0.25    6.9467 
-0.125    0.3969 
-0.0625    -2.7835 
-0.0938    -1.1963 
-0.1094    -0.4008 
-0.1172    -0.0022 
-0.1211    0.1973 
-0.1191    0.0975 
-0.1182    0.0476 
-0.1177    0.0227 
-0.1174    0.0102 
-0.1173    0.004 
-0.1172    9e-04

Newton’s Method

Isaak Newton invented the following algorithm: we want to solve the equation \(f(x)=0\). Let x₀ be some starting point. Then find successive points with \[ x_{n+1} = x_n - f(x_n)/f'(x_n) \]

Notice that if this sequence converges to some x we have

\[ x = x - f(x)/f'(x) \] and so \(f(x)=0\).

In our case we have \(f(a) = \frac{dl}{da}\) and so we also need

\[ f'(a) = \frac{d^2l}{da^2}=-\sum_{i=1}^n\left( \frac{2x_i-1}{2ax_i+1-a} \right)^2 \] we find

f.prime<- function(a) {
  y <- a
  for(i in seq_along(a)) 
    y[i] <- (-1)*sum( ((2*x-1)/(2*a[i]*x+1-a[i]))^2 )
  y
}
x.old <- 0
repeat {
  x.new <- x.old - f(x.old)/f.prime(x.old)
  cat(round(x.new, 6), "  ", round(f(x.new), 6),"\n")
  if(abs(x.old-x.new)<0.0001) break  
  x.old <- x.new
}

## -0.116733    -0.025417 
## -0.117231    1e-06 
## -0.117231    0

Notice that his converges much faster, it only needed three “rounds”. This is typically true, however Newton’s method also can fail badly if the starting point is not good enough.

Example Old Faithful geyser

youtube video

Consider the waiting times of the famous Old Faithful data:

bw <- diff(range(faithful$Waiting.Time))/50 
ggplot(faithful, aes(Waiting.Time)) +
          geom_histogram(color = "black", 
                 fill = "white", 
                 binwidth = bw) + 
  labs(x = "Waiting Times", y = "Counts")

What would be a useful model for this data? We can try a normal mixture:

\[ \alpha N(\mu_1, \sigma_1) +(1-\alpha) N(\mu_2, \sigma_2) \]

It seems that the two parts split at around 65, so we find

x <- faithful$Waiting.Time
round(c(mean(x[x<65]), mean(x[x>65])), 1)

## [1] 54.1 80.0

round(c(sd(x[x<65]), sd(x[x>65])), 1)

## [1] 5.4 5.9

How about \(\alpha\)? Let’s find the mle:

\[ \begin{aligned} &\phi(x, \mu, \sigma) = \frac1{\sqrt{2\pi\sigma^2}}\exp^{-\frac1{2\sigma^2}(x-\mu)^2}\\ &l(\alpha) = \sum \log \left[ \alpha\phi(x_i, \mu_1, \sigma_1) + (1-\alpha) \phi(x_i, \mu_2, \sigma_2) \right]\\ &\frac{dl}{d\alpha} = \sum \frac{\phi(x_i, \mu_1, \sigma_1) - \phi(x_i, \mu_2, \sigma_2)}{\alpha\phi(x_i, \mu_1, \sigma_1) + (1-\alpha) \phi(x_i, \mu_2, \sigma_2)} \\ &\frac{d^2l}{d\alpha^2} = (-1) \sum \left( \frac{\phi(x_i, \mu_1, \sigma_1) - \phi(x_i, \mu_2, \sigma_2)}{\alpha\phi(x_i, \mu_1, \sigma_1) + (1-\alpha) \phi(x_i, \mu_2, \sigma_2) } \right)^2 \end{aligned} \]

f <- function(alpha, mu1=54.1, sigma1=5.4,
              mu2=80, sigma2=5.9,
              x=faithful$Waiting.Time) {
  u <- dnorm(x, mu1, sigma1)
  v <- dnorm(x, mu2, sigma2)
  y1 <- alpha
  y2 <- alpha
  for(i in 1:seq_along(alpha)) {
    tmp <- (u-v)/(alpha[i]*u+(1-alpha[i])*v)
    y1[i] <- sum(tmp)
    y2[i] <- (-1)*sum(tmp^2)
  }
  list(y1, y2)
}
alpha.old <- 0.5
repeat {
  tmp <- f(alpha.old)
  alpha.new <- alpha.old - tmp[[1]]/tmp[[2]]
  print(alpha.new)
  if(abs(alpha.old-alpha.new)<0.0001) break  
  alpha.old <- alpha.new
}

## [1] 0.3550228
## [1] 0.3545702
## [1] 0.3545704

alpha <- alpha.old
alpha

## [1] 0.3545702

Let’s see what this looks like:

x <- seq(min(faithful$Waiting.Time),
         max(faithful$Waiting.Time),
         length=250)
y <- alpha*dnorm(x, 54.1, 5.4) + 
     (1-alpha)*dnorm(x, 80, 5.9)
df <- data.frame(x=x, y=y)
bw <- diff(range(faithful$Waiting.Time))/50 
ggplot(faithful, aes(Waiting.Time)) +
  geom_histogram(aes(y = ..density..),
                 color = "black", 
                 fill = "white", 
                 binwidth = bw) + 
  labs(x = "Waiting Times", y = "Counts") +
  geom_line(aes(x, y),  data=df, size=1.5, col="blue")

How about the other parameters? Can we fit for them as well? What we need is a multivariate version of Newton’s method:

Say we have the equation

\[ \begin{aligned} &f(x_1, .., x_n)=0 \\ \end{aligned} \] Define the gradient \(\Delta\) and the Hessian matrix H by

\[ \Delta_i(x) = \frac{df}{dx_i}(x_1,..,x_n) \\ H_{i,j}(x) = \frac{d^2f}{dx_i dx_j}(x_1,..,x_n) \] then the algorithm is

\[ x_{new} = x_{old}-H_{i,j}^{-1}(x_{old})\Delta_i(x_{old}) \]

Let’s fix \(\alpha=0.355\), \(\sigma_1=5.4\) and \(\sigma_2=5.9\) and fit for \(\mu_1\) and \(\mu_2\).

\[ \begin{aligned} &f(\mu)=\phi(x, \mu,\sigma) = \frac1{\sqrt{2\pi\sigma^2}} \exp \left({-\frac1{2\sigma^2}} (x-\mu)^2\right)\\ &f'(\mu)=\frac{d \phi}{d\mu} =\frac1{\sqrt{2\pi\sigma^2}} \exp \left({-\frac1{2\sigma^2}} (x-\mu)^2 \right) \frac{x-\mu}{\sigma^2} = \\ & (x-\mu)f/\sigma^2\\ &f''(\mu)=-f/\sigma^2+(x-\mu)^2f/\sigma^4=\\ &\left[ (x-\mu)^2-\sigma^2\right]f/\sigma^4 \end{aligned} \]

Let’s use the following definitions (short-hands):

\[ \begin{aligned} &f_{i, j} = \phi(x_i, \mu_j, \sigma_j) \text{, } j=1,2 \\ &\psi_i = \alpha f_{i,1}+(1-\alpha)f_{i,2} \\ & \end{aligned} \] so we have

\[ \begin{aligned} &l(\mu_1, \mu_2) = \sum \log \psi_i \\ &\Delta_1 = \frac{dl}{d\mu_1} = \alpha\sum \frac{ f_{i,1}'}{\psi_i}\\ &\Delta_2 = (1-\alpha)\sum \frac{f_{i,2}'}{\psi_i}\\ &H_{1,1} = \alpha \sum \frac{f_{i,1}''-\alpha f_{i,1}'^2 }{\psi_i}\\ &H_{1,2} = -(1-\alpha)\alpha \sum \frac{ f_{i,1}'f_{i,2}'}{\psi_i}\\ &H_{2,1}=H_{1,2}\\ &H_{2,2} = (1-\alpha) \sum \frac{f_{i,2}''-(1-\alpha) f_{i,2}'^2 }{\psi_i}\\ \end{aligned} \]

Let’s implement this:

alpha <- 0.355
sigma <- c(5.4, 5.9)
mu.old <- c(50, 80)
grad <- rep(0, 2)
H <- diag(2)
k <- 0
x <- faithful$Waiting.Time
repeat {
  k <- k+1
  f1 <- dnorm(x, mu.old[1], sigma[1])
  f1.prime <- (x-mu.old[1])*f1/sigma[1]^2
  f1.doubleprime <-
    ((x-mu.old[1]^2-sigma[1]^2))*f1/sigma[1]^4
  f2 <- dnorm(x, mu.old[2], sigma[2])
  f2.prime <- (x-mu.old[2])*f2/sigma[2]^2
  f2.doubleprime <-
    ((x-mu.old[2]^2-sigma[2]^2))*f2/sigma[2]^4  
  psi <- alpha*f1+(1-alpha)*f2
  grad[1] <- alpha*sum(f1.prime/psi)
  grad[2] <- (1-alpha)*sum(f2.prime/psi)
  H[1, 1] <- alpha*sum((f1.doubleprime-alpha*f1.prime^2)/psi)
  H[1, 2] <- (alpha-1)*alpha*sum((f1.prime*f2.prime)/psi)
  H[2, 1] <- H[2, 1]
  H[2, 2] <- (1-alpha)*sum((f2.doubleprime-(1-alpha)*f2.prime^2)/psi)
  mu.new <- c(mu.old-solve(H)%*%cbind(grad))
  print(c(mu.new, sum(log(psi))))
  if(sum(abs(mu.old-mu.new))<0.001) break
  if(k>10) break
  mu.old <- mu.new
}

## [1]    50.04450    79.99726 -1061.44499
## [1]    50.08849    79.99457 -1060.91574
## [1]    50.13196    79.99192 -1060.39760
## [1]    50.17492    79.98931 -1059.89032
## [1]    50.21739    79.98675 -1059.39364
## [1]    50.25937    79.98423 -1058.90732
## [1]    50.30086    79.98175 -1058.43113
## [1]    50.34187    79.97931 -1057.96484
## [1]    50.38242    79.97691 -1057.50821
## [1]    50.42250    79.97456 -1057.06105
## [1]    50.46213    79.97224 -1056.62312

Or we can make use of R:

x <- faithful$Waiting.Time
fun <- function(mu, alpha=0.355, sigma=c(5.4, 5.9)) {
  f1 <- dnorm(x, mu[1], sigma[1])
  f2 <- dnorm(x, mu[2], sigma[2])
  psi <- alpha*f1+(1-alpha)*f2
  -sum(log(psi))
}
optim(c(50,80), fun)

## $par
## [1] 54.44039 79.99045
## 
## $value
## [1] 1034.465
## 
## $counts
## function gradient 
##       49       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

In fact why not fit for all?

x <- faithful$Waiting.Time
fun <- function(par) {
  f1 <- dnorm(x, par[2], par[3])
  f2 <- dnorm(x, par[4], par[5])
  psi <- par[1]*f1+(1-par[1])*f2
  -sum(log(psi))
}
res <- optim(c(0.5, 50, 5.4, 80, 5.9), fun)
unlist(res)

##            par1            par2            par3            par4            par5 
##       0.3610815      54.6234464       5.8871485      80.1082477       5.8562652 
##           value counts.function counts.gradient     convergence 
##    1034.0027004     211.0000000              NA       0.0000000

y <- res$par[1]*dnorm(x, res$par[2], res$par[3]) + 
     (1-res$par[1])*dnorm(x, res$par[4], res$par[5])
df <- data.frame(x=x, y=y)
bw <- diff(range(faithful$Waiting.Time))/50 
ggplot(faithful, aes(Waiting.Time)) +
  geom_histogram(aes(y = ..density..),
                 color = "black", 
                 fill = "white", 
                 binwidth = bw) + 
  labs(x = "Waiting Times", y = "Counts") +
  geom_line(aes(x, y),  data=df, size=1.5, col="blue")