Regression to the Mean

Case Study: Exam Scores ESMA 3102

The following graph has the scores of 100 randomly selected students from my recent ESMA 3102 classes

attach(examscores)
head(examscores)

##     Exam Score
## 1 Exam 1   115
## 2 Exam 1    35
## 3 Exam 1    15
## 4 Exam 1    80
## 5 Exam 1    50
## 6 Exam 1    45

splot(Score, Exam, add.line = 1)

slr(Score, Exam)

## The least squares regression equation is: 
##  Score  = 51.47 + 4.62 Exam 
## R^2 = 0.79%

Notice the following feature:

\[ \text{Exam 2} = 32.1 + 0.466 \text{ Exam 1} \]

What does this mean?

Students who did well in the first exam tended to do well also in the second exam (\(\beta_1 > 0\)) but not as well as in the first one (\(\beta_1 < 1\))

Students who did badly in the first exam tended to do badly also in the second exam (\(\beta_1 > 0\)) but not as badly as in the first one (\(\beta_1 < 1\))

What explains this? Maybe:

Students who did well on exam 1 didn’t think they had to work so hard on exam 2, so they were lazy?
Students who did badly on exam 1 knew they had to work harder on exam 2?
Other arguments?

Maybe, but maybe this is just

Regression to the Mean

The principle of regression to the mean was first described by Sir Francis Galton, one of the great scientists in history and one of the first statisticans (he also invented the term correlation). In one of his studies he measured the heights of 982 children and their parents (actually, fathers and sons).

attach(galton)

## The following objects are masked from galton (pos = 11):
## 
##     Child, Midparent

splot(Child, Midparent,  add.line=1, jitter=TRUE)

Note that I “jittered” the data a bit because Galton measured the heights ot the nearest half inch, so there are a lot of repetitions.

Again we find:

slr(Child, Midparent)

## The least squares regression equation is: 
##  Child  = 23.942 + 0.646 Midparent 
## R^2 = 21.05%

and so \(0 < \beta_1 < 1\), which implies:

sons of tall fathers are tall, but not as tall as their fathers.
sons of small fathers are small, but not as small as their fathers.

And that is a good thing! (??)

Galton actually invented the term regression to the mean for this phenomena: extremes tend to regress (return) to the overall average.

Regression to the mean is one of the most missunderstood principles of statistics. There is always a tendency to look for a reason for this return to the average (student who did well on first exam got lazy and so did worse on second exam).

Here is a general description of when and how regression to the mean happens:

there is an “experiment” that results in a “score”
the scores are due in part to “skill” and in part to luck
the “high scorers” are “selected” and repeat the experiment
they still have the “skill” but will they again have the luck? certainly not all of them

Example Students take tests

“experiment” = take a test
“skill” = student knows material (or not), luck: gets question they just studied, make a lucky guess
repeat the experiment = take another test
the good students will do good again, but some will not be lucky again

Example Heights of fathers and sons

“experiment” = a father’s height is measured
“skill” = genetics, luck=randomness in how someones height
let’s concentrate on those who are tall, repeat the experiment = consider their sons
they are still tall (because of genetics) but not as tall (because they did not get as “lucky” as their fathers)

Example Traffic lights installed at an intersection after a series of accidents

“experiment” = how many accidents are there in some intersection?
“skill” = some intersections are more dangerous than others, luck = the number of accidents on any one intersection will fluctuate randomly.
repeat the experiment = see how many accidents happen now
the lights will help prevent accidents, but likely there would not have been as many anyway.

Note that this does not say that there is no effect here, just that it may not be as great as it appears.

Example Students: there are students who really know the material

Example Heights: genetics

Example traffic lights: they do prevent accidents (mostly)

One feature of regression to the mean is that it works both ways: change “high” for “low” above and everything is still true.