We have previously discussed polynomial regression, which is one way beyond linear regression. In this section we discuss two others.
It is often possible to turn a non-linear model into a linear one via a transformation. Say for example that we want to fit an exponential model of the form \(y=ae^{bx}\), then
\[\log y = \log ae^{bx} = \log a +bx\]
so this is equivalent to a linear model in \(\log y\).
In Westchester County, north of New York City, Consolidated Edison bills residential customers for electricity on a monthly basis. The company wants to predict residential usage, in order to plan purchases of fuel and budget revenue flow. The data includes information on usage (in kilowatt-hours per day) and average monthly temperature for 55 consecutive months for an all-electric home. Data on consumption of electricity and the temperature in Westchester County, NY.
attach(elusage)
head(elusage)
## Month Year Usage Temperature
## 1 8 1989 24.828 73
## 2 9 1989 24.688 67
## 3 10 1989 19.310 57
## 4 11 1989 59.706 43
## 5 12 1989 99.667 26
## 6 1 1990 49.333 41
fitlin=lm(Usage~Temperature, data=elusage)
df <- data.frame(Residuals=resid(fitlin),
Fits = fitted(fitlin))
ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
so the residual vs fits plot shows a clear pattern, so a linear model is not good. Now
fitlogy=lm(log(Usage)~Temperature, data=elusage)
df <- data.frame(Residuals=resid(fitlogy),
Fits = fitted(fitlogy))
ggplot(data=df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
is much better
exp(coef(fitlogy)[1])
## (Intercept)
## 207.513
coef(fitlogy)[2]
## Temperature
## -0.03187381
shows that the model is
\[y=207.5e^{-0.0319x}\]
which looks like this:
x=seq(min(elusage$Temperature), max(elusage$Temperature), length=100)
y=exp(coef(fitlogy)[1])*exp(coef(fitlogy)[2]*x)
df=data.frame(x=x,y=y)
ggplot(data=elusage, aes(Temperature, Usage)) +
geom_point() +
geom_line(aes(x, y), data=df, col="blue", size=1.2)
There are issues when finding confidence intervals and doing hypothesis tests, but we won’t discuss these here.
Sometimes the model we wish to fit is known, up to parameters. Generally that is the case if there is a scientific theory that predicts the shape of the relationship. For example, radioactive decay is known to be exponential: \(y=\alpha e^{-\beta t}\)
Growth of Lobsters
Data from an experiment to raise Florida lobster in a controlled environment. The data shows the overall length and the age of a certain species of lobster.
kable.nice(lobster[1:10, ])
Time | Length | |
---|---|---|
1 | 14 | 59 |
2 | 22 | 92 |
3 | 28 | 131 |
4 | 35 | 175 |
5 | 40 | 215 |
6 | 50 | 275 |
7 | 56 | 289 |
8 | 63 | 269 |
9 | 71 | 395 |
10 | 77 | 434 |
ggplot(data=lobster, aes(Time, Length)) +
geom_point()
Now biology suggests that the relationship should be of the form
\[ y=\frac{\beta_2}{1+(\beta_2 - \beta_0)/\beta_0 \exp (\beta_1 t)} + \epsilon \] where
This is often called the logistic or autocatalytic model
Previously we have always used the method of least squares to estimate the parameters in our models, that is we minimized the “figure of merit”
\[ \text{RSS} = \sum (y_i - \beta_0 - \beta_1 x_i)^2 \]
the natural extension of this is to use
\[ \text{RSS} = \sum (y_i - f(x_i; \boldsymbol{\beta} ))^2 \]
now for a linear model minimizing this expression could be done with lm. This however is still a minimization problem, and we can do it with
fit <- nls(Length ~ beta[3]/(1 + ((beta[3] -
beta[1])/beta[1]) * exp(beta[2] * Time)),
start = list(beta = c(10, -0.1, 500)),
data = lobster)
summary(fit)
##
## Formula: Length ~ beta[3]/(1 + ((beta[3] - beta[1])/beta[1]) * exp(beta[2] *
## Time))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## beta1 32.008757 6.755720 4.738 0.000164
## beta2 -0.057557 0.004957 -11.612 8.55e-10
## beta3 465.884778 8.340739 55.857 < 2e-16
##
## Residual standard error: 21.63 on 18 degrees of freedom
##
## Number of iterations to convergence: 7
## Achieved convergence tolerance: 7.722e-06
x <- seq(10, 160, 1)
df <- data.frame(x=x,
y = predict(fit,
newdata = data.frame(Time = x)))
ggplot(data=lobster, aes(Time, Length)) +
geom_point() +
geom_line(data=df, aes(x, y), color="blue")
Prime Number Theorem
That there were infinitely many prime numbers was first proven by the Greek mathematician Euclid at around 300BC. A serious study of how fast they grow was first begun by Adrienne-Marie Legendre. He studied the function N(k), which gives the number of primes less or equal to k. We can do the same. The primes up to 1,000,000 are available at
primes <- scan("C://Users//Wolfgang//dropbox//teaching//Esma-6616//primes.txt")
primes <- as.integer(primes)
length(primes)
## [1] 78498
matrix(primes[1:100], ncol=10, byrow = TRUE)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2 3 5 7 11 13 17 19 23 29
## [2,] 31 37 41 43 47 53 59 61 67 71
## [3,] 73 79 83 89 97 101 103 107 109 113
## [4,] 127 131 137 139 149 151 157 163 167 173
## [5,] 179 181 191 193 197 199 211 223 227 229
## [6,] 233 239 241 251 257 263 269 271 277 281
## [7,] 283 293 307 311 313 317 331 337 347 349
## [8,] 353 359 367 373 379 383 389 397 401 409
## [9,] 419 421 431 433 439 443 449 457 461 463
## [10,] 467 479 487 491 499 503 509 521 523 541
kable.nice(matrix(primes[1:100], ncol=10, byrow = TRUE))
## Error in sprintf(" <td%s> %s </td>", align, z): arguments cannot be recycled to the same length
A detailed study of these primes led Legendre in 1798 to propose the function
\[ N(k)=k/(\log k - \alpha) \] Here is what that looks like for several values of \(\alpha\):
N <- function(k, alpha) {
k/(log(k)-alpha)
}
k <- seq(1000, 1e6, length=250)
exact.counts <- k
for(i in 1:250)
exact.counts[i] <- sum(primes<k[i])
df <- data.frame(N=c(k, k, k, k),
Counts=c(exact.counts, N(k, 0), N(k, 1), N(k, 2)),
Method=rep(c("Counts", "a=0", "a=1", "a=2"),
each=250))
ggplot(df, aes(N, Counts, color=Method)) +
geom_line()
and so it seems a value of \(\alpha=1\) is good.
Legendre however was not satisfied with that, he wanted to find the optimal answer. So he found the least squares solution!
fit <- nls(exact.counts ~ k/(log(k) - alpha),
start = list(alpha = 0))
coef(fit)
## alpha
## 1.082001
and so he claimed that
\[ N(k)=k/(\log k - 1.08) \]
Around the same time German mathematician Carl Friedrich Gauss also looked at this problem, and he made a different conjecture. He said
\[ N(k)=k/\log k \] That was a rather strange guess, because it looks like this:
N <- function(k, alpha) {
k/(log(k)-alpha)
}
k <- seq(1000, 1e6, length=250)
exact.counts <- k
for(i in 1:250)
exact.counts[i] <- sum(primes<k[i])
df <- data.frame(N=c(k, k),
Counts=c(exact.counts, N(k, 0)),
Method=rep(c("Counts", "Gauss"), each=250))
ggplot(df, aes(N, Counts, color=Method)) +
geom_line()
and it surely looks like the two curves are growing further apart. However, almost 100 years later in 1896 the French mathematicians Jacques-Salomon Hadamard and Charles de la Valée Poussin independently showed that Gauss was right!
From our modern point of view we might say Legendre was guilty of over-fitting!