Sometimes the model we wish to fit is known, up to parameters. Generally that is the case if there is a scientific theory that predicts the shape of the relationship. For example, radioactive decay is known to be exponential: \(y=\alpha e^{-\beta t}\)
Data from an experiment to raise Florida lobster in a controlled environment. The data shows the overall length and the age of a certain species of lobster.
kable.nice(lobster[1:10, ])
Time | Length |
---|---|
14 | 59 |
22 | 92 |
28 | 131 |
35 | 175 |
40 | 215 |
50 | 275 |
56 | 289 |
63 | 269 |
71 | 395 |
77 | 434 |
ggplot(data=lobster, aes(Time, Length)) +
geom_point()
Now biology suggests that the relationship should be of the form
\[ y=\frac{\beta_2}{1+(\beta_2 - \beta_0)/\beta_0 \exp (\beta_1 t)} + \epsilon \] where
This is often called the logistic or autocatalytic model
How do we fit such a model, that is find “optimal” values of \(\beta_0\), \(\beta_1\) and \(\beta_2\)? Sometimes it is possible use transformations to “linearize” the model, for example we have of course \(\log (y)= \log(\alpha)-\beta t\) for the radioactive decay model. This is not possible, though, for the logistic model, so we have to find a different solution.
Previously we have always used the method of least squares to estimate the parameters in our models, that is we minimized the “figure of merit”
\[ \text{RSS} = \sum (y_i - \beta_0 - \beta_1 x_i)^2 \] the natural extension of this is to use
\[ \text{RSS} = \sum (y_i - f(x_i; \boldsymbol{\beta} ))^2 \]
now for a linear model minimizing this expression could be done with lm. This however is still a minimization problem, and we can do it with
fit <- nls(Length ~ beta[3]/(1 + ((beta[3] -
beta[1])/beta[1]) * exp(beta[2] * Time)),
start = list(beta = c(10, -0.1, 500)),
data = lobster)
summary(fit)
##
## Formula: Length ~ beta[3]/(1 + ((beta[3] - beta[1])/beta[1]) * exp(beta[2] *
## Time))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## beta1 32.008757 6.755720 4.738 0.000164
## beta2 -0.057557 0.004957 -11.612 8.55e-10
## beta3 465.884778 8.340739 55.857 < 2e-16
##
## Residual standard error: 21.63 on 18 degrees of freedom
##
## Number of iterations to convergence: 7
## Achieved convergence tolerance: 7.722e-06
x <- seq(10, 160, 1)
df <- data.frame(x=x,
y = predict(fit,
newdata = data.frame(Time = x)))
ggplot(data=lobster, aes(Time, Length)) +
geom_point() +
geom_line(data=df, aes(x, y), color="blue")
That there were infinitely many prime numbers was first proven by the Greek mathematician Euclid at around 300BC. A serious study of how fast they grow was first begun by Adrienne-Marie Legendre. He studied the function N(k), which gives the number of primes less or equal to k. We can do the same. The primes up to 1,000,000 are available at
primes <- scan("C://Users//Wolfgang//dropbox//teaching//Computing-with-R//primes.txt")
primes <- as.integer(primes)
kable.nice(matrix(primes[1:100], ncol=10, byrow = TRUE))
2 | 3 | 5 | 7 | 11 | 13 | 17 | 19 | 23 | 29 |
31 | 37 | 41 | 43 | 47 | 53 | 59 | 61 | 67 | 71 |
73 | 79 | 83 | 89 | 97 | 101 | 103 | 107 | 109 | 113 |
127 | 131 | 137 | 139 | 149 | 151 | 157 | 163 | 167 | 173 |
179 | 181 | 191 | 193 | 197 | 199 | 211 | 223 | 227 | 229 |
233 | 239 | 241 | 251 | 257 | 263 | 269 | 271 | 277 | 281 |
283 | 293 | 307 | 311 | 313 | 317 | 331 | 337 | 347 | 349 |
353 | 359 | 367 | 373 | 379 | 383 | 389 | 397 | 401 | 409 |
419 | 421 | 431 | 433 | 439 | 443 | 449 | 457 | 461 | 463 |
467 | 479 | 487 | 491 | 499 | 503 | 509 | 521 | 523 | 541 |
A detailed study of these primes led Legendre in 1798 to propose the function
\[ N(k)=k/(\log k - \alpha) \] Here is what that looks like for several values of \(\alpha\):
N <- function(k, alpha) {
k/(log(k)-alpha)
}
k <- seq(1000, 1e6, length=250)
exact.counts <- k
for(i in 1:250)
exact.counts[i] <- sum(primes<k[i])
df <- data.frame(N=c(k, k, k, k),
Counts=c(exact.counts, N(k, 0), N(k, 1), N(k, 2)),
Method=rep(c("Counts", "a=0", "a=1", "a=2"),
each=250))
ggplot(df, aes(N, Counts, color=Method)) +
geom_line()
and so it seems a value of \(\alpha=1\) is good.
Legendre however was not satisfied with that, he wanted to find the optimal answer. So he found the least squares solution!
fit <- nls(exact.counts ~ k/(log(k) - alpha),
start = list(alpha = 0))
coef(fit)
## alpha
## 1.082001
and so he claimed that
\[ N(k)=k/(\log k - 1.08) \]
Around the same time German mathematician Carl Friedrich Gauss also looked at this problem, and he made a different conjecture. He said
\[ N(k)=k/\log k \] That was a rather strange guess, because it looks like this:
N <- function(k, alpha) {
k/(log(k)-alpha)
}
k <- seq(1000, 1e6, length=250)
exact.counts <- k
for(i in 1:250)
exact.counts[i] <- sum(primes<k[i])
df <- data.frame(N=c(k, k),
Counts=c(exact.counts, N(k, 0)),
Method=rep(c("Counts", "Gauss"), each=250))
ggplot(df, aes(N, Counts, color=Method)) +
geom_line()
and it surely looks like the two curves are growing further apart. However, almost 100 years later in 1896 the French mathematicians Jacques-Salomon Hadamard and Charles de la Valée Poussin independently showed that Gauss was right!
From our modern point of view we might say Legendre was guilty of over-fitting!