Prediction

Categorical - Categorical

Case Study: Treatment for Hair Loss

Say we want to know the following: what is the percentage of men who using Rogain will grow no hair? The answer is simple: 301 of 714 for 301/714*100% = 42.2% of the men in the treatment group had no hair growth. As always though in Statistics we also want an estimate of the error in this prediction. We learned in 3101 how to do this:

one.sample.prop(301, 714)
## A 95% confidence interval for the population proportion is (0.385, 0.459)

Notice, though, that this calculation uses only the numbers 301 and 714, not any of the other results of the experiment. Moreover, if we did the same calculation for all the combinations of groups we would calculate 10 confidence intervals, and again we have a problem of simultaneous inference.

It turns out that this is a type of problem too difficult for this class.

Categorical - Quantitative

Case Study - Babies and Cocain Use by the Mother

Find 95% confidence intervals for the lengths of the babies in the Drug Free group:

attach(mothers)
one.sample.t(Length[Status=="Drug Free"], ndigit = 2)
## A 95% confidence interval for the population mean is (50.16, 52.04)

The difficulty again is if we do this for all three groups:

  • Drug Free (50.16cm, 52.04cm)
  • First Trimester (48.09cm, 50.51cm)
  • Throughout (46.78cm, 49.22cm)

because these are individual ci’s, not a collection of ci’s with the correct confidence level. As above we have the problem of
simultaneous inference.

Quantitative -Quantitative

Case Study: Quality of Fish

A study was conducted to examine the quality of fish after several days in ice storage. Ten raw fish of the same kind and quality were caught and prepared for storage. Two of the fish were placed in ice storage immediately after being caught, two were placed there after 3 hours, and two each after 6, 9 and 12 hours. Then all the fish were left in storage for 7 days. Finally they were examined and rated according to their “freshness”.

Use this data set to estimate the quality of a fish that was put into ice 4 hours after being caught.

attach(fish)
fish
##    Time Quality
## 1     0     8.5
## 2     0     8.4
## 3     3     7.9
## 4     3     8.1
## 5     6     7.8
## 6     6     7.6
## 7     9     7.3
## 8     9     7.0
## 9    12     6.8
## 10   12     6.7
splot(Quality, Time)

slr(Quality, Time)

## The least squares regression equation is: 
##  Quality  = 8.46 - 0.142 Time 
## R^2 = 96.88%

assumptions look ok.

so we have
\[ \text{Quality} = 8.46 - 0.142 * 4 = 7.9 \] We can also let R do the calculation for us:

slr.predict(Quality, Time, newx=4)
##  Time  Fit
##     4 7.89

Confidence vs. Prediction Intervals

Again we want an idea of the “error” in our estimate. Previously we used confidence intervals to do this. Here we will again use confidence intervals, but in the context of regression there are two types of intervals:

Confidence Interval - used to predict the mean response of many observations with the desired x value.

Prediction Interval - used to predict the individual response of one observation with the desired x value.

Warning The terminology is a little confusing here, with the same term meaning different things: Both confidence intervals and prediction intervals as found by the regression command are confidence intervals in the sense discussed before, and both are used for prediction!

They differ in what they are trying to predict, on the one hand an individual response (PI), on the other hand the mean of many responses (CI).

Example Let’s consider the Quality of Fish data. Use this data set to find a 95% interval estimate for the quality of a fish that was put into storage after 4 hours.

We are talking about one fish, so we want a prediction interval:

slr.predict(Quality, Time, newx=4, interval="PI")
##  Time  Fit Lower Upper
##     4 7.89   7.6  8.19

so a 95% prediction interval for the rating of fish after 4 hours is (7.60, 8.19)

Example Again consider the Quality of Fish data. Use this data set to find a 90% interval estimate for the mean quality of fish that were put into storage after 4 hours.

Now we are interested in the mean rating of many fish, so we want a confidence interval. Also we want a 90% interval instead of 95%:

slr.predict(Quality, Time, newx=4, 
            interval="CI", conf.level = 90)
##  Time  Fit Lower Upper
##     4 7.89  7.81  7.97

so a 90% confidence interval for the mean rating of fish after 4 hours is (7.81, 7.97).

The two 90% intervals are shown in the next graph, the prediction interval in green and the confidence interval in red:

Notice that the prediction intervals are always wider than the confidence intervals. They are also the ones you want most of the time. So if you are not sure which you should use, use the prediction interval.

The slr.predict command can also be used to find a number of fits and intervals simultaneously:

slr.predict(Quality, Time, newx=1:10, 
            interval="PI", conf.level = 90)
##  Time  Fit Lower Upper
##     1 8.32  8.07  8.57
##     2 8.18  7.93  8.42
##     3 8.04  7.79  8.28
##     4 7.89  7.66  8.13
##     5 7.75  7.52  7.99
##     6 7.61  7.37  7.85
##     7 7.47  7.23  7.70
##     8 7.33  7.09  7.56
##     9 7.19  6.94  7.43
##    10 7.04  6.80  7.29

If the newx argument is left off the predicition is done for the data itself:

slr.predict(Quality, Time, 
            interval="PI", conf.level = 90)
##  Time  Fit Lower Upper
##     0 8.46  8.20  8.72
##     0 8.46  8.20  8.72
##     3 8.04  7.79  8.28
##     3 8.04  7.79  8.28
##     6 7.61  7.37  7.85
##     6 7.61  7.37  7.85
##     9 7.19  6.94  7.43
##     9 7.19  6.94  7.43
##    12 6.76  6.50  7.02
##    12 6.76  6.50  7.02

Prediction vs. Extrapolation

There is a fundamental difference between predicting the response for an x value within the range of observed x values (=Prediction) and for an x value outside the observed x values (=Extrapolation). The problem here is that the model used for prediction is only known to be good for the range of x values that were used to find it. Whether or not it is the same outside these values is generally impossible to tell.

Note Another word for prediction is interpolation

Example: Quality of Fish data