The following graph has the scores of 100 randomly selected students from my recent ESMA 3102 classes
attach(examscores)
head(examscores)
## Exam Score
## 1 Exam 1 115
## 2 Exam 1 35
## 3 Exam 1 15
## 4 Exam 1 80
## 5 Exam 1 50
## 6 Exam 1 45
splot(Score, Exam, add.line = 1)
slr(Score, Exam)
## The least squares regression equation is:
## Score = 51.47 + 4.62 Exam
## R^2 = 0.79%
Notice the following feature:
\[ \text{Exam 2} = 32.1 + 0.466 \text{ Exam 1} \]
What does this mean?
Students who did well in the first exam tended to do well also in the second exam (\(\beta_1 > 0\)) but not as well as in the first one (\(\beta_1 < 1\))
Students who did badly in the first exam tended to do badly also in the second exam (\(\beta_1 > 0\)) but not as badly as in the first one (\(\beta_1 < 1\))
What explains this? Maybe:
Students who did well on exam 1 didn’t think they had to work so hard on exam 2, so they were lazy?
Students who did badly on exam 1 knew they had to work harder on exam 2?
Other arguments?
Maybe, but maybe this is just
Regression to the Mean
The principle of regression to the mean was first described by Sir Francis Galton, one of the great scientists in history and one of the first statisticans (he also invented the term correlation). In one of his studies he measured the heights of 982 children and their parents (actually, fathers and sons).
attach(galton)
## The following objects are masked from galton (pos = 11):
##
## Child, Midparent
splot(Child, Midparent, add.line=1, jitter=TRUE)
Note that I “jittered” the data a bit because Galton measured the heights ot the nearest half inch, so there are a lot of repetitions.
Again we find:
slr(Child, Midparent)
## The least squares regression equation is:
## Child = 23.942 + 0.646 Midparent
## R^2 = 21.05%
and so \(0 < \beta_1 < 1\), which implies:
sons of tall fathers are tall, but not as tall as their fathers.
sons of small fathers are small, but not as small as their fathers.
And that is a good thing! (??)
Galton actually invented the term regression to the mean for this phenomena: extremes tend to regress (return) to the overall average.
Regression to the mean is one of the most missunderstood principles of statistics. There is always a tendency to look for a reason for this return to the average (student who did well on first exam got lazy and so did worse on second exam).
Here is a general description of when and how regression to the mean happens:
there is an “experiment” that results in a “score”
the scores are due in part to “skill” and in part to luck
the “high scorers” are “selected” and repeat the experiment
they still have the “skill” but will they again have the luck? certainly not all of them
Example Students take tests
Example Heights of fathers and sons
Example Traffic lights installed at an intersection after a series of accidents
Note that this does not say that there is no effect here, just that it may not be as great as it appears.
Example Students: there are students who really know the material
Example Heights: genetics
Example traffic lights: they do prevent accidents (mostly)
One feature of regression to the mean is that it works both ways: change “high” for “low” above and everything is still true.