Nonlinear Parametric Models

Sometimes the model we wish to fit is known, up to parameters. Generally that is the case if there is a scientific theory that predicts the shape of the relationship. For example, radioactive decay is known to be exponential: \(y=\alpha e^{-\beta t}\)

Example: Growth of Lobsters

Data from an experiment to raise Florida lobster in a controlled environment. The data shows the overall length and the age of a certain species of lobster.

kable.nice(lobster[1:10, ])
Time Length
14 59
22 92
28 131
35 175
40 215
50 275
56 289
63 269
71 395
77 434
ggplot(data=lobster, aes(Time, Length)) +
  geom_point()

Now biology suggests that the relationship should be of the form

\[ y=\frac{\beta_2}{1+(\beta_2 - \beta_0)/\beta_0 \exp (\beta_1 t)} + \epsilon \] where

  • \(\beta_0\) is the expected value of y at time t=0
  • \(\beta_1\) is a measure of the growth rate
  • \(\beta_2\) is the expected limit of y as \(t\rightarrow \infty\)

This is often called the logistic or autocatalytic model

How do we fit such a model, that is find “optimal” values of \(\beta_0\), \(\beta_1\) and \(\beta_2\)? Sometimes it is possible use transformations to “linearize” the model, for example we have of course \(\log (y)= \log(\alpha)-\beta t\) for the radioactive decay model. This is not possible, though, for the logistic model, so we have to find a different solution.

Previously we have always used the method of least squares to estimate the parameters in our models, that is we minimized the “figure of merit”

\[ \text{RSS} = \sum (y_i - \beta_0 - \beta_1 x_i)^2 \] the natural extension of this is to use

\[ \text{RSS} = \sum (y_i - f(x_i; \boldsymbol{\beta} ))^2 \]

now for a linear model minimizing this expression could be done with lm. This however is still a minimization problem, and we can do it with

fit <- nls(Length ~ beta[3]/(1 + ((beta[3] -
          beta[1])/beta[1]) *  exp(beta[2] * Time)), 
          start = list(beta = c(10, -0.1, 500)), 
          data = lobster)
summary(fit)
## 
## Formula: Length ~ beta[3]/(1 + ((beta[3] - beta[1])/beta[1]) * exp(beta[2] * 
##     Time))
## 
## Parameters:
##         Estimate Std. Error t value Pr(>|t|)
## beta1  32.008757   6.755720   4.738 0.000164
## beta2  -0.057557   0.004957 -11.612 8.55e-10
## beta3 465.884778   8.340739  55.857  < 2e-16
## 
## Residual standard error: 21.63 on 18 degrees of freedom
## 
## Number of iterations to convergence: 7 
## Achieved convergence tolerance: 7.722e-06
x <- seq(10, 160, 1)
df <- data.frame(x=x, 
                 y = predict(fit, 
                  newdata = data.frame(Time = x)))
 ggplot(data=lobster, aes(Time, Length)) +
  geom_point() +
  geom_line(data=df, aes(x, y), color="blue")

Example: Prime Number Theorem

That there were infinitely many prime numbers was first proven by the Greek mathematician Euclid at around 300BC. A serious study of how fast they grow was first begun by Adrienne-Marie Legendre. He studied the function N(k), which gives the number of primes less or equal to k. We can do the same. The primes up to 1,000,000 are available at

primes <- scan("C://Users//Wolfgang//dropbox//teaching//Computing-with-R//primes.txt")
primes <- as.integer(primes)
kable.nice(matrix(primes[1:100], ncol=10, byrow = TRUE))
2 3 5 7 11 13 17 19 23 29
31 37 41 43 47 53 59 61 67 71
73 79 83 89 97 101 103 107 109 113
127 131 137 139 149 151 157 163 167 173
179 181 191 193 197 199 211 223 227 229
233 239 241 251 257 263 269 271 277 281
283 293 307 311 313 317 331 337 347 349
353 359 367 373 379 383 389 397 401 409
419 421 431 433 439 443 449 457 461 463
467 479 487 491 499 503 509 521 523 541

A detailed study of these primes led Legendre in 1798 to propose the function

\[ N(k)=k/(\log k - \alpha) \] Here is what that looks like for several values of \(\alpha\):

N <- function(k, alpha) {
  k/(log(k)-alpha)
}
k <- seq(1000, 1e6, length=250)
exact.counts <- k
for(i in 1:250) 
  exact.counts[i] <- sum(primes<k[i])
df <- data.frame(N=c(k, k, k, k), 
      Counts=c(exact.counts, N(k, 0), N(k, 1), N(k, 2)),
      Method=rep(c("Counts", "a=0", "a=1", "a=2"),
                 each=250))
ggplot(df, aes(N, Counts, color=Method)) +
  geom_line()

and so it seems a value of \(\alpha=1\) is good.

Legendre however was not satisfied with that, he wanted to find the optimal answer. So he found the least squares solution!

fit <- nls(exact.counts ~ k/(log(k) - alpha), 
          start = list(alpha = 0))
coef(fit)
##    alpha 
## 1.082001

and so he claimed that

\[ N(k)=k/(\log k - 1.08) \]

Around the same time German mathematician Carl Friedrich Gauss also looked at this problem, and he made a different conjecture. He said

\[ N(k)=k/\log k \] That was a rather strange guess, because it looks like this:

N <- function(k, alpha) {
  k/(log(k)-alpha)
}
k <- seq(1000, 1e6, length=250)
exact.counts <- k
for(i in 1:250) 
  exact.counts[i] <- sum(primes<k[i])
df <- data.frame(N=c(k, k), 
      Counts=c(exact.counts, N(k, 0)),
      Method=rep(c("Counts", "Gauss"),   each=250))
ggplot(df, aes(N, Counts, color=Method)) +
  geom_line()

and it surely looks like the two curves are growing further apart. However, almost 100 years later in 1896 the French mathematicians Jacques-Salomon Hadamard and Charles de la Valée Poussin independently showed that Gauss was right!

From our modern point of view we might say Legendre was guilty of over-fitting!