In this section we will study the problem of parameter estimation. In its most general form this is as follows: we have a sample \(X_1,..,X_n\) from some probability density \(f(x;\theta)\). Here both x and \(\theta\) might be vectors. Also we will use the term density for both the discrete and the continuous case.
The problem is to find an estimate of \(\theta\) based on the data \(X_1,..,X_n\), that is a function (called a statistic) \(T(X_1,..,X_n)\) such that in some sense \(T(X_1,..,X_n) \approx \theta\).
in a survey 567 people 235 said they prefer Coke over Pepsi. What is the percentage of people who prefer Coke over Pepsi?
The answer seems obvious: 235/567. However, let’s work this out in detail. First, each person is a Bernoulli trial (yes/no) with some success probability \(\pi\). So we have the density
\[ P(X_i=1)=1-P(X_i=0)=\pi \]
which we can write as
\[ f(x)=\pi^x(1-\pi)^{1-x} \text{, }x=0,1 \]
the joint density is given by
\[ \begin{aligned} &f(x_1, .., x_n)= \\ &\prod_{i=1}^n \pi^x_i(1-\pi)^{1-x_i} = \\ & \pi^{\sum x_i}(1-\pi)^ {\sum(1-x_i)} = \\ & \pi^y(1-\pi)^{n-y} \\ \end{aligned} \] where \(y=\sum x_i\) is the number of successes.
In our case this becomes \(\pi^{235}(1-\pi)^{332}\) and the task is to estimate \(\pi\).
say we have \(X_1,..,X_n \sim f(x;\theta)\) and independent. Then the joint density of \(\mathbf{X}= (X_1,..,X_n)\) is given by
\[ f(\mathbf{x};\theta)=\prod_{i=1}^n f(x_i;\theta) \]
The likelihood function L is defined by
\[ L(\theta;\mathbf{x})=\prod_{i=1}^n f(x_i;\theta) \]
this does not seem to be much: the right hand side is the same. However, it is a very different expression: in \(f(\mathbf{x};\theta)\) \(\mathbf{x}\) is the variable and \(\theta\) is an (unknown) constant. In \(L(\theta;\mathbf{x})\) \(\theta\) is the variable and \(\mathbf{x}\) is a (known) constant.
It is the essential difference of before the experiment, when one might ask questions of probability, and after the experiment, when one asks questions of statistics.
Closely related is the log-likelihood function
\[ l(\theta;\mathbf{x})=\log L(\theta;\mathbf{x})=\sum_{i=1}^n \log f(x_i;\theta) \] The log-likelihood is often easier to work with, not the least because it turns a product into a sum.
There is a principle in Statistics that suggests that any inference should always be based on the likelihood function.
We already found the joint density
\[ \pi^y(1-\pi)^{n-y} \]
and so the log likelihood is
\[ \begin{aligned} &l(\pi;y,n) = y\log \pi +(n-y) \log (1-\pi)\\ \end{aligned} \]
Say \(X_i \sim N(\mu, \sigma)\) so
\[ \begin{aligned} &l(\mu;\mathbf{x}, \sigma) = \\ &\sum_{i=1}^n \log \left[ \frac1{\sqrt{2\pi \sigma^2}} \exp \left\{- \frac1{2\sigma^2} (x_i-\mu)^2 \right\} \right]= \\ &\frac{n}2 \log (2\pi \sigma^2) - \frac1{2\sigma^2} \sum_{i=1}^n (x_i-\mu)^2 \\ \end{aligned} \] Now let \(\bar{x}=\frac1n \sum x_i\), then
\[ \begin{aligned} &\sum_{i=1}^n (x_i-\mu)^2 = \\ &\sum_{i=1}^n (x_i-\bar{x}+\bar{x}-\mu)^2 = \\ & \sum_{i=1}^n (x_i-\bar{x})^2 + 2\sum_{i=1}^n (x_i-\bar{x})(\bar{x}-\mu)+\sum_{i=1}^n (\bar{x}-\mu)^2 = \\ & \sum_{i=1}^n (x_i-\bar{x})^2 + 2(\bar{x}-\mu)\sum_{i=1}^n (x_i-\bar{x})+n(\bar{x}-\mu)^2 = \\ & \sum_{i=1}^n (x_i-\bar{x})^2 + 2(\bar{x}-\mu)(\sum_{i=1}^n x_i-n\bar{x})+n(\bar{x}-\mu)^2 = \\ & \sum_{i=1}^n (x_i-\bar{x})^2+n(\bar{x}-\mu)^2 \end{aligned} \] and so
\[ l(\mu;\mathbf{x}, \sigma) = \text{const}-\frac{1}{2\sigma^2/n} (\mu-\bar{x})^2 \] so as a function of \(\mu\) the log-likelihood is a quadratic function with vertex at \(\bar{x}\)
The idea of maximum likelihood estimation is to find that value of \(\theta\) that maximizes the likelihood function. In an obvious way, this is the value of the parameter that best “agrees”" with the data.
Of course a function \(L\) has a maximum at x iff \(\log L\) has a maximum at x, so we can also (and easier!) maximize the log-likelihood function
\[ \frac{dl(\mu;\mathbf{x}, \sigma)}{d\mu} = -\frac{1}{\sigma^2/n} (\mu-\bar{x})=0 \] so \(\hat{\mu} = \bar{x}\)
This is of course a maximum, and not a minimum or an inflection point because \(-\frac{1}{\sigma^2/n}<0\).
\[ \begin{aligned} &\frac{dl}{d\pi}=\frac{y}{\pi}-\frac{n-y}{1-\pi} =0 \\ &\hat{\pi} =\frac{y}{n} \end{aligned} \] and for our numbers we find \(\hat{\pi}=235/527=0.4459\)
The above calculations require some calculus. Sometimes we can let R take care of this for us:
ll <- function(pi, y, n)
log(dbinom(y, n, pi))
pi <- seq(0.4, 0.491, length=1000)
df <- data.frame(pi=pi, ll=ll(pi, 235, 527))
mle <- df$pi[which.max(df$ll)]
mle
## [1] 0.4459099
ggplot(df, aes(x=pi, y=ll)) +
geom_line(size=1.5, col="blue") +
geom_vline(xintercept = mle, size=2)
notice that the log-likelihood curve looks a lot like a parabola. This is not an accident, and it will come in handy soon!
A random variable is said to have a Beta density if
\[ f(x;\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}\text{, }0<x<1 \] here \(\Gamma\) is the famous gamma function
\[ \Gamma(x)=\int_0^{\infty} t^{x-1}e^{-t}dt \]
Let’s simulate a sample from a Beta distribution
x <- sort(round(rbeta(500, 2, 4), 3))
beta.ex <- data.frame(x=x)
bw <- diff(range(x))/50
ggplot(beta.ex, aes(x)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density")
and we want to estimate \(\alpha, \beta\).
Now doing this with calculus is out because the log-likelihood function doesn’t exist in closed form. Instead we will need to use a numerical method. Because the Beta distribution is a standard one, there is an R routine to do it for us. It is part of the package
library(MASS)
fitdistr(x,
densfun="Beta",
start=list(shape1=1,shape2=1))
## shape1 shape2
## 2.0716029 4.2045833
## (0.1227766) (0.2646915)
here are observations from a linear density \(f(x|a)=2ax+1-a\), \(0<x<1\) and \(-1<a<1\):
0.005 0.011 0.025 0.03 0.031 0.031 0.05 0.059 0.061 0.064 0.067 0.068 0.072 0.075 0.082 0.083 0.084 0.101 0.102 0.104 0.106 0.112 0.114 0.117 0.125 0.136 0.137 0.143 0.145 0.146 0.17 0.174 0.175 0.186 0.197 0.198 0.209 0.209 0.209 0.229 0.234 0.234 0.242 0.256 0.269 0.275 0.279 0.281 0.283 0.293 0.306 0.311 0.311 0.313 0.328 0.338 0.347 0.358 0.362 0.371 0.381 0.384 0.392 0.406 0.409 0.429 0.431 0.44 0.447 0.447 0.45 0.455 0.456 0.458 0.485 0.492 0.494 0.498 0.503 0.506 0.507 0.535 0.55 0.559 0.561 0.577 0.584 0.586 0.587 0.594 0.597 0.598 0.599 0.604 0.608 0.616 0.623 0.625 0.638 0.641 0.644 0.665 0.667 0.676 0.704 0.722 0.73 0.731 0.731 0.731 0.733 0.735 0.738 0.742 0.742 0.743 0.746 0.75 0.751 0.755 0.766 0.768 0.792 0.795 0.796 0.804 0.812 0.821 0.834 0.837 0.837 0.861 0.865 0.873 0.878 0.88 0.886 0.897 0.916 0.923 0.928 0.94 0.944 0.948 0.959 0.961 0.962 0.969 0.972 0.974
We want to estimate a. So let’s see:
\[ \begin{aligned} &f(x|a) = \prod_{i=1}^n \left[2ax_i+1-a \right] =\\ &l(a) =\sum_{i=1}^n \log \left[2ax_i+1-a \right] \\ &\frac{dl}{da}=\sum_{i=1}^n\frac{2x_i-1}{2ax_i+1-a}=0 \end{aligned} \] and this equation can not be solved analytically. Unfortunately this is not one of the distributions included in fitdistr, so we need to find a numerical solution ourselves.
Here are several:
Let’s draw the curve of \(\frac{dl}{da}\). In what follows x is the data above.
f <- function(a) {
y <- a
for(i in seq_along(a))
y[i] <- sum( (2*x-1)/(2*a[i]*x+1-a[i]) )
y
}
a <- seq(-0.5, 0.5, length=1000)
y <- f(a)
# find value of a where y is closest to 0
mle <- a[which.min(abs(y))]
df <- data.frame(a=a, y=y)
ggplot(df, aes(a, y)) +
geom_line(color="blue", size=1.5) +
geom_hline(yintercept=0) +
geom_vline(xintercept=mle)
round(mle, 4)
## [1] -0.1176
The idea is this: our function is positive for a=-0.5 and negative for a=0.5. It is also continuous and decreasing. So we can find the zero by checking midpoints and adjusting the upper or lower limit accordingly:
low <- (-0.5)
high <- 0.5
repeat {
mid <- (low+high)/2
y <- f(mid)
cat(round(mid, 4), " ", round(y, 4),"\n")
if(y>0) low <- mid
else high <- mid
if(high-low<0.0001) break
}
0 -5.962
-0.25 6.9467
-0.125 0.3969
-0.0625 -2.7835
-0.0938 -1.1963
-0.1094 -0.4008
-0.1172 -0.0022
-0.1211 0.1973
-0.1191 0.0975
-0.1182 0.0476
-0.1177 0.0227
-0.1174 0.0102
-0.1173 0.004
-0.1172 9e-04
Isaak Newton invented the following algorithm: we want to solve the equation \(f(x)=0\). Let x0 be some starting point. Then find successive points with \[ x_{n+1} = x_n - f(x_n)/f'(x_n) \]
Notice that if this sequence converges to some x we have
\[ x = x - f(x)/f'(x) \] and so \(f(x)=0\).
In our case we have \(f(a) = \frac{dl}{da}\) and so we also need
\[ f'(a) = \frac{d^2l}{da^2}=-\sum_{i=1}^n\left( \frac{2x_i-1}{2ax_i+1-a} \right)^2 \] we find
f.prime<- function(a) {
y <- a
for(i in seq_along(a))
y[i] <- (-1)*sum( ((2*x-1)/(2*a[i]*x+1-a[i]))^2 )
y
}
x.old <- 0
repeat {
x.new <- x.old - f(x.old)/f.prime(x.old)
cat(round(x.new, 6), " ", round(f(x.new), 6),"\n")
if(abs(x.old-x.new)<0.0001) break
x.old <- x.new
}
## -0.116733 -0.025417
## -0.117231 1e-06
## -0.117231 0
Notice that his converges much faster, it only needed three “rounds”. This is typically true, however Newton’s method also can fail badly if the starting point is not good enough.
Consider the waiting times of the famous Old Faithful data:
bw <- diff(range(faithful$Waiting.Time))/50
ggplot(faithful, aes(Waiting.Time)) +
geom_histogram(color = "black",
fill = "white",
binwidth = bw) +
labs(x = "Waiting Times", y = "Counts")
What would be a useful model for this data? We can try a normal mixture:
\[ \alpha N(\mu_1, \sigma_1) +(1-\alpha) N(\mu_2, \sigma_2) \]
It seems that the two parts split at around 65, so we find
x <- faithful$Waiting.Time
round(c(mean(x[x<65]), mean(x[x>65])), 1)
## [1] 54.1 80.0
round(c(sd(x[x<65]), sd(x[x>65])), 1)
## [1] 5.4 5.9
How about \(\alpha\)? Let’s find the mle:
\[ \begin{aligned} &\phi(x, \mu, \sigma) = \frac1{\sqrt{2\pi\sigma^2}}\exp^{-\frac1{2\sigma^2}(x-\mu)^2}\\ &l(\alpha) = \sum \log \left[ \alpha\phi(x_i, \mu_1, \sigma_1) + (1-\alpha) \phi(x_i, \mu_2, \sigma_2) \right]\\ &\frac{dl}{d\alpha} = \sum \frac{\phi(x_i, \mu_1, \sigma_1) - \phi(x_i, \mu_2, \sigma_2)}{\alpha\phi(x_i, \mu_1, \sigma_1) + (1-\alpha) \phi(x_i, \mu_2, \sigma_2)} \\ &\frac{d^2l}{d\alpha^2} = (-1) \sum \left( \frac{\phi(x_i, \mu_1, \sigma_1) - \phi(x_i, \mu_2, \sigma_2)}{\alpha\phi(x_i, \mu_1, \sigma_1) + (1-\alpha) \phi(x_i, \mu_2, \sigma_2) } \right)^2 \end{aligned} \]
f <- function(alpha, mu1=54.1, sigma1=5.4,
mu2=80, sigma2=5.9,
x=faithful$Waiting.Time) {
u <- dnorm(x, mu1, sigma1)
v <- dnorm(x, mu2, sigma2)
y1 <- alpha
y2 <- alpha
for(i in 1:seq_along(alpha)) {
tmp <- (u-v)/(alpha[i]*u+(1-alpha[i])*v)
y1[i] <- sum(tmp)
y2[i] <- (-1)*sum(tmp^2)
}
list(y1, y2)
}
alpha.old <- 0.5
repeat {
tmp <- f(alpha.old)
alpha.new <- alpha.old - tmp[[1]]/tmp[[2]]
print(alpha.new)
if(abs(alpha.old-alpha.new)<0.0001) break
alpha.old <- alpha.new
}
## [1] 0.3550228
## [1] 0.3545702
## [1] 0.3545704
alpha <- alpha.old
alpha
## [1] 0.3545702
Let’s see what this looks like:
x <- seq(min(faithful$Waiting.Time),
max(faithful$Waiting.Time),
length=250)
y <- alpha*dnorm(x, 54.1, 5.4) +
(1-alpha)*dnorm(x, 80, 5.9)
df <- data.frame(x=x, y=y)
bw <- diff(range(faithful$Waiting.Time))/50
ggplot(faithful, aes(Waiting.Time)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "Waiting Times", y = "Counts") +
geom_line(aes(x, y), data=df, size=1.5, col="blue")
How about the other parameters? Can we fit for them as well? What we need is a multivariate version of Newton’s method:
Say we have the equation
\[ \begin{aligned} &f(x_1, .., x_n)=0 \\ \end{aligned} \] Define the gradient \(\Delta\) and the Hessian matrix H by
\[ \Delta_i(x) = \frac{df}{dx_i}(x_1,..,x_n) \\ H_{i,j}(x) = \frac{d^2f}{dx_i dx_j}(x_1,..,x_n) \] then the algorithm is
\[ x_{new} = x_{old}-H_{i,j}^{-1}(x_{old})\Delta_i(x_{old}) \]
Let’s fix \(\alpha=0.355\), \(\sigma_1=5.4\) and \(\sigma_2=5.9\) and fit for \(\mu_1\) and \(\mu_2\).
\[ \begin{aligned} &f(\mu)=\phi(x, \mu,\sigma) = \frac1{\sqrt{2\pi\sigma^2}} \exp \left({-\frac1{2\sigma^2}} (x-\mu)^2\right)\\ &f'(\mu)=\frac{d \phi}{d\mu} =\frac1{\sqrt{2\pi\sigma^2}} \exp \left({-\frac1{2\sigma^2}} (x-\mu)^2 \right) \frac{x-\mu}{\sigma^2} = \\ & (x-\mu)f/\sigma^2\\ &f''(\mu)=-f/\sigma^2+(x-\mu)^2f/\sigma^4=\\ &\left[ (x-\mu)^2-\sigma^2\right]f/\sigma^4 \end{aligned} \]
Let’s use the following definitions (short-hands):
\[ \begin{aligned} &f_{i, j} = \phi(x_i, \mu_j, \sigma_j) \text{, } j=1,2 \\ &\psi_i = \alpha f_{i,1}+(1-\alpha)f_{i,2} \\ & \end{aligned} \] so we have
\[ \begin{aligned} &l(\mu_1, \mu_2) = \sum \log \psi_i \\ &\Delta_1 = \frac{dl}{d\mu_1} = \alpha\sum \frac{ f_{i,1}'}{\psi_i}\\ &\Delta_2 = (1-\alpha)\sum \frac{f_{i,2}'}{\psi_i}\\ &H_{1,1} = \alpha \sum \frac{f_{i,1}''-\alpha f_{i,1}'^2 }{\psi_i}\\ &H_{1,2} = -(1-\alpha)\alpha \sum \frac{ f_{i,1}'f_{i,2}'}{\psi_i}\\ &H_{2,1}=H_{1,2}\\ &H_{2,2} = (1-\alpha) \sum \frac{f_{i,2}''-(1-\alpha) f_{i,2}'^2 }{\psi_i}\\ \end{aligned} \]
Let’s implement this:
alpha <- 0.355
sigma <- c(5.4, 5.9)
mu.old <- c(50, 80)
grad <- rep(0, 2)
H <- diag(2)
k <- 0
x <- faithful$Waiting.Time
repeat {
k <- k+1
f1 <- dnorm(x, mu.old[1], sigma[1])
f1.prime <- (x-mu.old[1])*f1/sigma[1]^2
f1.doubleprime <-
((x-mu.old[1]^2-sigma[1]^2))*f1/sigma[1]^4
f2 <- dnorm(x, mu.old[2], sigma[2])
f2.prime <- (x-mu.old[2])*f2/sigma[2]^2
f2.doubleprime <-
((x-mu.old[2]^2-sigma[2]^2))*f2/sigma[2]^4
psi <- alpha*f1+(1-alpha)*f2
grad[1] <- alpha*sum(f1.prime/psi)
grad[2] <- (1-alpha)*sum(f2.prime/psi)
H[1, 1] <- alpha*sum((f1.doubleprime-alpha*f1.prime^2)/psi)
H[1, 2] <- (alpha-1)*alpha*sum((f1.prime*f2.prime)/psi)
H[2, 1] <- H[2, 1]
H[2, 2] <- (1-alpha)*sum((f2.doubleprime-(1-alpha)*f2.prime^2)/psi)
mu.new <- c(mu.old-solve(H)%*%cbind(grad))
print(c(mu.new, sum(log(psi))))
if(sum(abs(mu.old-mu.new))<0.001) break
if(k>10) break
mu.old <- mu.new
}
## [1] 50.04450 79.99726 -1061.44499
## [1] 50.08849 79.99457 -1060.91574
## [1] 50.13196 79.99192 -1060.39760
## [1] 50.17492 79.98931 -1059.89032
## [1] 50.21739 79.98675 -1059.39364
## [1] 50.25937 79.98423 -1058.90732
## [1] 50.30086 79.98175 -1058.43113
## [1] 50.34187 79.97931 -1057.96484
## [1] 50.38242 79.97691 -1057.50821
## [1] 50.42250 79.97456 -1057.06105
## [1] 50.46213 79.97224 -1056.62312
Or we can make use of R:
x <- faithful$Waiting.Time
fun <- function(mu, alpha=0.355, sigma=c(5.4, 5.9)) {
f1 <- dnorm(x, mu[1], sigma[1])
f2 <- dnorm(x, mu[2], sigma[2])
psi <- alpha*f1+(1-alpha)*f2
-sum(log(psi))
}
optim(c(50,80), fun)
## $par
## [1] 54.44039 79.99045
##
## $value
## [1] 1034.465
##
## $counts
## function gradient
## 49 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
In fact why not fit for all?
x <- faithful$Waiting.Time
fun <- function(par) {
f1 <- dnorm(x, par[2], par[3])
f2 <- dnorm(x, par[4], par[5])
psi <- par[1]*f1+(1-par[1])*f2
-sum(log(psi))
}
res <- optim(c(0.5, 50, 5.4, 80, 5.9), fun)
unlist(res)
## par1 par2 par3 par4 par5
## 0.3610815 54.6234464 5.8871485 80.1082477 5.8562652
## value counts.function counts.gradient convergence
## 1034.0027004 211.0000000 NA 0.0000000
y <- res$par[1]*dnorm(x, res$par[2], res$par[3]) +
(1-res$par[1])*dnorm(x, res$par[4], res$par[5])
df <- data.frame(x=x, y=y)
bw <- diff(range(faithful$Waiting.Time))/50
ggplot(faithful, aes(Waiting.Time)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "Waiting Times", y = "Counts") +
geom_line(aes(x, y), data=df, size=1.5, col="blue")