In this section we want to use a model to make predictions for y for some fixed x.
A study was conducted to examine the quality of fish after several days in ice storage. Ten raw fish of the same kind and quality were caught and prepared for storage. Two of the fish were placed in ice storage immediately after being caught, two were placed there after 3 hours, and two each after 6, 9 and 12 hours. Then all the fish were left in storage for 7 days. Finally they were examined and rated according to their “freshness”.
Use this data set to estimate the quality of a fish that was put into ice 4 hours after being caught.
kable.nice(fish)
Time | Quality |
---|---|
0 | 8.5 |
0 | 8.4 |
3 | 7.9 |
3 | 8.1 |
6 | 7.8 |
6 | 7.6 |
9 | 7.3 |
9 | 7.0 |
12 | 6.8 |
12 | 6.7 |
plt <- ggplot(fish, aes(Time, Quality)) +
geom_point()
plt
fit <- lm(Quality ~ Time, data=fish)
round(fit$coef, 3)
## (Intercept) Time
## 8.460 -0.142
so we have
\[
\text{Quality} = 8.46 - 0.142 * 4 = 7.9
\] We can also let R do the calculation for us:
round(predict(fit, newdata=data.frame(Time=4)), 2)
## 1
## 7.89
Again we want an idea of the “error” in our estimate. Previously we used confidence intervals to do this. Here we will again use confidence intervals, but in the context of regression there are two types of intervals:
Confidence Interval - used to predict the mean response of many observations with the desired x value.
Prediction Interval - used to predict the individual response of one observation with the desired x value.
Warning
The terminology is a little confusing here, with the same term meaning different things: Both confidence intervals and prediction intervals as found by the regression command are confidence intervals in the sense discussed before, and both are used for prediction!
They differ in what they are trying to predict, on the one hand an individual response (PI), on the other hand the mean of many responses (CI).
Use this data set to find a 95% interval estimate for the quality of a fish that was put into storage after 4 hours.
We are talking about one fish, so we want a prediction interval:
round(predict(fit, newdata=data.frame(Time=4),
interval="prediction"), 2)
## fit lwr upr
## 1 7.89 7.6 8.19
so a 95% prediction interval for the rating of fish after 4 hours is (7.60, 8.19)
Example Again consider the Quality of Fish data. Use this data set to find a 90% interval estimate for the mean quality of fish that were put into storage after 4 hours.
Now we are interested in the mean rating of many fish, so we want a confidence interval. Also we want a 90% interval instead of 95%:
round(predict(fit, newdata=data.frame(Time=4),
interval="confidence",
level = 0.90), 2)
## fit lwr upr
## 1 7.89 7.81 7.97
so a 90% confidence interval for the mean rating of fish after 4 hours is (7.81, 7.97).
The two 90% intervals are shown in the next graph, the prediction interval in green and the confidence interval in red:
Notice that the prediction intervals are always wider than the confidence intervals.
The predict command can also be used to find a number of fits and intervals simultaneously:
round(predict(fit, newdata=data.frame(Time=1:10),
interval="prediction",
level = 0.99), 2)
## fit lwr upr
## 1 8.32 7.87 8.77
## 2 8.18 7.74 8.62
## 3 8.04 7.60 8.47
## 4 7.89 7.46 8.32
## 5 7.75 7.33 8.18
## 6 7.61 7.19 8.03
## 7 7.47 7.04 7.89
## 8 7.33 6.90 7.76
## 9 7.19 6.75 7.62
## 10 7.04 6.60 7.48
If the newdata argument is left off the prediction is done for the data itself:
round(predict(fit), 2)
## 1 2 3 4 5 6 7 8 9 10
## 8.46 8.46 8.04 8.04 7.61 7.61 7.19 7.19 6.76 6.76
There is a fundamental difference between predicting the response for an x value within the range of observed x values (=Prediction) and for an x value outside the observed x values (=Extrapolation). The problem here is that the model used for prediction is only known to be good for the range of x values that were used to find it. Whether or not it is the same outside these values is generally impossible to tell.
Note Another word for prediction is interpolation
Example: Quality of Fish data