Results from an experiment designed to determine how much the speed of a washing machine effects the wear on a new fabric. The machine was run at 5 different speeds (measured in rpm) and with six pieces of fabric each.
tmp <- tapply(fabricwear[, 2],
fabricwear[, 1],
function(x) x)
tmp <- data.frame(
paste0(unique(fabricwear[, 1]), " rpm:"),
matrix(unlist(tmp), nrow=5, byrow=TRUE))
colnames(tmp) <- NULL
rownames(tmp) <- NULL
kable.nice(tmp)
110 rpm: | 24.9 | 24.8 | 25.1 | 26.4 | 27.0 | 26.6 |
130 rpm: | 27.4 | 27.3 | 26.4 | 28.5 | 28.7 | 28.5 |
150 rpm: | 30.4 | 30.3 | 29.5 | 31.7 | 31.6 | 31.4 |
170 rpm: | 37.9 | 36.9 | 37.2 | 38.5 | 38.8 | 39.1 |
190 rpm: | 48.7 | 42.7 | 47.8 | 49.6 | 49.9 | 49.5 |
The scatterplot of wear by speed shows a strong but non-linear relationship:
ggplot(data=fabricwear, aes(Speed, Wear)) +
geom_point()+
geom_smooth(method = "lm", se=FALSE)
How strong is a difficult question, because Pearson’s correlation coefficient won’t work here. If we tried lm we would see in the residual vs fits plot that there is a problem with the assumption of a linear model:
fit <- lm(Wear~Speed, data=fabricwear)
df <- data.frame(Fits=fitted(fit),
Residuals=resid(fit))
ggplot(df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
So the question is: how do we fit models other than straight lines?
There are two basic things we can try. The first is something we have already done, namely the log transformations
unfortunately non of these looks very good
Some of these have names:
log(y) vs. x is called an exponential model
log(y) vs. log(x) is called a power model
The other solution to our problem is to fit a polynomial model:
Linear \(y=\beta_0+\beta_1 x\)
Quadratic \(y=\beta_0+\beta_1 x+\beta_2 x^2\)
Cubic \(y=\beta_0+\beta_1 x+\beta_2 x^2+\beta_3 x^3\)
and so on
How do we fit such a model? We simply calculate the powers and use them in lm:
Speed2 <- Speed^2
quad.fit <- lm(Wear~Speed+Speed2, data=fabricwear)
quad.df <- data.frame(Fits=fitted(quad.fit),
Residuals=resid(quad.fit))
ggplot(quad.df, aes(Fits, Residuals)) +
geom_point() +
geom_hline(yintercept = 0)
What does such a curve look like?
x <- seq(min(fabricwear$Speed),
max(fabricwear$Speed),
length=250)
y <- coef(quad.fit)[1] +
coef(quad.fit)[2]*x +
coef(quad.fit)[3]*x^2
df.tmp <- data.frame(x=x, y=y)
ggplot(data=fabricwear, aes(Speed, Wear)) +
geom_point() +
geom_line(data=df.tmp, aes(x, y),
inherit.aes = FALSE,
color="blue", size=1.2)
There is however something not so good about this solution: our x values are of the size 100, so their square is of order 10000. Using variables with very different sizes can lead to troubles in a regression. Also we have
cor(fabricwear$Speed, Speed2)
## [1] 0.9969033
and again using highly correlated predictors is an issue. One way around these problems is to use the poly command:
poly.fit <- lm(Wear~poly(Speed, 2), data=fabricwear)
coef(poly.fit)
## (Intercept) poly(Speed, 2)1 poly(Speed, 2)2
## 34.10333 42.39626 13.20218
The poly command does two things: it scales all the variables so that their mean is 0 and their standard deviation is 1, and it changes them in such a way that they are uncorrelated. This makes the regression calculations much more stable.
If the goal is prediction, this is fine:
predict(quad.fit,
newdata=data.frame(Speed=150, Speed2=150^2))
## 1
## 31.22238
predict(poly.fit, newdata=data.frame(Speed=150))
## 1
## 31.22238
but when it comes to getting the actual model, we would have to “reverse” the calculations done by poly, and we don’t even know what those were, exactly.
There is an intermediate solution that sometimes works well: scale the x variable first:
mn <- mean(fabricwear$Speed)
s <- sd(fabricwear$Speed)
x <- (fabricwear$Speed-mn)/s
x2 <- x^2
round(c(mn, s, cor(x, x2)), 2)
## [1] 150.00 28.77 0.00
quad.scale.fit <- lm(fabricwear$Wear~x+x2)
coef(quad.scale.fit)
## (Intercept) x x2
## 31.222381 7.872787 2.980296
so first we also got uncorrelated predictors (that is just here, but in general the correlations will be low). Also:
\[ \begin{aligned} &x=\frac{\text{Speed}-150}{28.77}\\ &y=31.2+7.87x+2.98x^2 \\ &y=31.2+7.87\frac{\text{ Speed}-150}{28.77}+2.98\left(\frac{\text{ Speed}-150}{28.77}\right)^2 = \\ & 31.2+0.2735\text{ Speed}-41.03+\\ &0.0036\text{ Speed}^2 -1.08\text{ Speed} +81= \\ &71.17-0.806\text{ Speed}+0.0036\text{ Speed}^2 \end{aligned} \] and that is the same as
coef(quad.fit)
## (Intercept) Speed Speed2
## 71.19916667 -0.80669048 0.00360119
Again we can use the predict command to do prediction, but there are some things we need to be careful with:
predict(quad.fit,
newdata=data.frame(Speed=150,
Speed2=150^2))
## 1
## 31.22238
lwear <- log(Wear)
lspeed <- log(Speed)
log.fit <- lm(lwear~lspeed)
exp(predict(log.fit,
newdata=data.frame(lspeed=log(150))))
## 1
## 33.87835
How about interval estimation? Let’s do a simple simulation: consider the model \(y=x^2\). It is both a quadratic model and linear in log-log, so we should be able to fit it either way:
B <- 1000
x <- 1:100/100
out <- matrix(0, 1000, 6)
lx <- log(x)
x2 <- x^2
for(i in 1:B) {
y <- x^2 + rnorm(100,0, 0.07)
pf <- lm(y~x+x2)
out[i, 1:3] <- predict(pf,
newdata=data.frame(x=0.5, x2=0.25),
interval="confidence")
ly <- log(y)
lf <- lm(ly~lx)
out[i, 4:6] <- exp(predict(lf,
newdata=data.frame(lx=log(0.5)),
interval="confidence"))
}
#quadratic model
sum(out[, 2]<0.5^2 & 0.5^2<out[, 3])/B
## [1] 0.949
#log transform
sum(out[, 5]<0.5^2 & 0.5^2<out[, 6])/B
## [1] 0.794
so this works fine for the quadratic model but fails miserably for the log transform.
Why is that? Part of the problem is the error term. Note that what we really have is
\[ y=x^2+\epsilon \]
so taking logs leads to
\[ \log y = \log (x^2+\epsilon) \] and not to what we are fitting, namely
\[ \log y = \log (x^2 ) +\epsilon \]