If there is a relationship between two quantitative variables “x” and “y”, can we describe it?
We do that by finding a model, that is an equation y=f(x) Here we keep it very simple and consider only linear relationships, that is equations of the form
\[ y = mx+b \]
In Statistics though we use a slightly different notation:
\[ y = \beta_0 + \beta_1 x \]
The logic here is this: if we know x, we can compute y. Unfortunately there are always “errors” in this calculation, so the answer y varies even for the same x.
For example, let x be the number of hours a student studies for an exam, and y the score on the exam. Say we know from long experience that y=50+5x. So even if the student doesn’t study at all (x=0) he/she would still get around 50 points, and for every hour studied the score goes up by about 5 points.
But of course there are many other factors influencing the grade such as general ability, previous experience, being healthy on the day of the exam, exam anxiety etc, so for any specific student the score will not be exactly what the equation predicts. So if three students all study 6 hours, the equation predicts a score of 50+5*6=80 for all of them but one might get a 69, the next a 78 and the third a 99. What the equation predicts is actually their mean score.
This is illustrated in the next graph:
where the scores of the people who studied 6 hours are in red, and their mean score is marked by an X.
run.app(lsr1)
this app illustrates the meaning of the line as the mean response.
\(\beta_0\) and \(\beta_1\) are numbers that depend on the population from which the data (X,Y) is drawn. Therefore they are parameters just like the mean or the median.
A standard problem is this: we have a data set and we believe there is a linear relationship between x and y. We would like to know the equation
\[ y = \beta_0 + \beta_1 x \]
that is we need to “guess” what \(\beta_0\) and \(\beta_1\) are. We will estimate them by a method called least squares regression. This is done by the R command slr:
Data from a study on consumption of wine (in liters per person) and heart disease rates (in per 100000) in 19 countries.
wine
## Country Wine.Consumption Heart.Disease.Deaths
## 1 Australia 2.5 211
## 2 Austria 3.9 167
## 3 Belgium 2.9 131
## 4 Canada 2.4 191
## 5 Denmark 2.9 220
## 6 Finland 0.8 297
## 7 France 9.1 71
## 8 Iceland 0.8 211
## 9 Ireland 0.7 300
## 10 Italy 7.9 107
## 11 Netherlands 1.8 167
## 12 New Zealand 1.9 266
## 13 Norway 0.8 227
## 14 Spain 6.5 86
## 15 Sweden 1.6 207
## 16 Switzerland 5.8 115
## 17 United Kingdom 1.3 285
## 18 United States 1.2 199
## 19 Germany 2.7 172
attach(wine)
splot(Heart.Disease.Deaths, Wine.Consumption)
so we see a clear negative correlation, the higher the wine consumption, the lower the heart disease rate.
Careful: this was an observational study, here correlation does NOT imply causation!
So, what can we say about the actual relationship?
slr(Heart.Disease.Deaths, Wine.Consumption)
## The least squares regression equation is:
## Heart.Disease.Deaths = 260.563 - 22.969 Wine.Consumption
## R^2 = 71.03%
Note the slr command also draws a graph, which we can ignore.
There is a nice graph called the fitted line plot, which is the scatterplot with the least square regression line added to it:
splot(Heart.Disease.Deaths, Wine.Consumption, add.line=1)
attach(draft)
slr(Draft.Number, Day.of.Year)
## The least squares regression equation is:
## Draft.Number = 225.009 - 0.226 Day.of.Year
## R^2 = 5.11%
splot(Draft.Number, Day.of.Year, add.line = 1)
run.app(lsr)
this app illustrates the least squares regression line.
Just play around with different slopes and intercepts and see how the fitted line plot and the regression equation change
Here are two important facts about least squares regression:
say \(\overline{X}\) is the mean of the x vector and \(\overline{Y}\) is the mean of the y vector, then (\(\overline{X}\) , \(\overline{Y}\) ) is always a point on the line.
We have seen previously that for the correlation coefficient it does not matter what variable we choose as X and which as Y, that is we have
cor(x,y) = cor(y,x)
Now let’s see what happens in regression
slr(Heart.Disease.Deaths, Wine.Consumption)
## The least squares regression equation is:
## Heart.Disease.Deaths = 260.563 - 22.969 Wine.Consumption
## R^2 = 71.03%
The least squares regression equation is:
Heart.Disease.Deaths = 260.563 - 22.969 Wine.Consumption
so
22.969 Wine.Consumption = 260.563 - Heart.Disease.Deaths
and
Wine.Consumption = 260.563/22.969 - 1/22.969 Heart.Disease.Deaths
Wine.Consumption = 11.34 - 0.044 Heart.Disease.Deaths
BUT
slr(Wine.Consumption, Heart.Disease.Deaths)
## The least squares regression equation is:
## Wine.Consumption = 8.935 - 0.031 Heart.Disease.Deaths
## R^2 = 71.03%
and that is not the same equation!
So in regression it is important to distinguish between the predictor or independent variable (x)
and the
response or dependent variable (y).
It has often been noted that anyone featured with his/her picture on the cover of Sports Illustrated is then jinxed, that is their performance goes down. Some people have tried to find an explanation for this, for example that an athlete gets lazy after being successful, or maybe that they have to many media days and can’t practice enough. In reality this is just an example of regression to the mean.
At the end of the 19th century Sir Francis Galton collected the height of the parent (actually fathers) and the height of the oldest child (son) of almost 1000 families.
The fitted line plot is
attach(galton)
splot(Child, Midparent, add.line= 1)
and the least squares regression equation is
slr(Child, Midparent)
## The least squares regression equation is:
## Child = 23.942 + 0.646 Midparent
## R^2 = 21.05%
Notice: the slope of the line (0.646) is
so we see that
Let’s add the line with slope 1 to the graph:
so we see that those observations on the extreme tend to “regress” (come back) to the “middle”.
Of course this makes good sense: if a person is tall because of their genes (and the child has half of those) but also because of a lot of other factors, many of which the child does not have.
There is something about the graph that is not very nice. Because the heights were recorded only to within the nearest .2 inches lot’s of data points repeat, but they appear only once in the graph. We can fix that by jittering the points, that is moving them randomly around just a little bit:
splot(Child, Midparent, jitter=TRUE, add.line= 1)
Ever notice that often those students that do very well on the first exam do not quite so well on the second? Is it that they got lazy, thought the exams are easy, that they could do well without studying?
Maybe, but maybe it also just an example of regression to the mean!
say we have the following data: a group of subjects is participating in a weight loss program. They are weighted before and after the program. Now we pick out the (say) 10% people that were the heaviest at the beginning, and we notice that their average weight was 207 pounds then but is only 201 pounds at the end of the program. Can we conclude the program worked (at least for heavy people)?
Maybe, but not necessarily. Again the same outcome could be due to regression to the mean.
This also explains the Jinx: an athlete gets on the cover after having done very well, likely a bit better than is normal even for them, after a while (Cover of no Cover) they will regress to the mean.
Regression towards the mean is one of those Statistical phenomena that is often misunderstood, with people looking for an explanation were this is none!
The psychologist Daniel Kahneman, winner of the 2002 Nobel Memorial Prize in Economic Sciences, pointed out that regression to the mean might explain why rebukes can seem to improve performance, while praise seems to backfire:
I had the most satisfying Eureka experience of my career while attempting to teach flight instructors that praise is more effective than punishment for promoting skill-learning. When I had finished my enthusiastic speech, one of the most seasoned instructors in the audience raised his hand and made his own short speech, which began by conceding that positive reinforcement might be good for the birds, but went on to deny that it was optimal for flight cadets. He said:
On many occasions I have praised flight cadets for clean execution of some aerobatic maneuver, and in general when they try it again, they do worse. On the other hand, I have often screamed at cadets for bad execution, and in general they do better the next time. So please don’t tell us that reinforcement works and punishment does not, because the opposite is the case.
This was a joyous moment, in which I understood an important truth about the world: because we tend to reward others when they do well and punish them when they do badly, and because there is regression to the mean, it is part of the human condition that we are statistically punished for rewarding others and rewarded for punishing them. I immediately arranged a demonstration in which each participant tossed two coins at a target behind his back, without any feedback. We measured the distances from the target and could see that those who had done best the first time had mostly deteriorated on their second try, and vice versa. But I knew that this demonstration would not undo the effects of lifelong exposure to a perverse contingency.
For more on regression to the mean go here.