Generally if there are more than two variables we are interested in their relationships. We want to investigate the two questions:
1) Is there a relationship?
2) If there is a relationship, can we describe it?
If both variables are quantitative, for the first question we can find the correlation and for the second we can do a regression.
In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.
Question: was the draft was really "random"?
Let's have a look at the scatterplot of "Day of the Year" and "Draft Number".
It certainly does not appear that there is a relationship between "Day of the Year" and "Draft Number", but is this really true?
What we want is a number that can tell is if there is a relationship between two variables, and if so how strong it is. Consider the following two examples:
Clearly in the case on the left we have a much stronger relationship than on the right. For example, if i knew x=0.5, then on the left i could reasonably guess that y is between 0.4 and 0.6, whereas on the right i could only guess 0.1 to 0.9.
The most popular choice for such a number is Pearson's correlation coefficient r
Using what we have learned before the formula of Pearson's correlation coefficient r is actually very simple. First recall the z scores:
z = (x-X̅)/s
of course now we have two variables x and y, so there are two z scores:
zx = (x-X̅)/sx
zy = (y-Y̅)/sy
r = (∑zxzy)/(n-1)
As a simple numerical example consider the following:
and then we get:
∑x = 15
x-X̅: -2 -1 0 1 2
S2= 10/4 =2.5
s = √2.5 = 1.58
zx: -1.27 -0.63 0.00 0.63 1.27
∑y = 13
y-Y̅: -1.6 0.4 -0.6 -0.6 2.4
S2 = 9.2/4 = 2.3
s = √2.3 = 1.51
zx: -1.27 -0.63 0.00 0.63 1.27
zy: -1.06 0.26 -0.40 -0.40 1.59
r = (∑zxzy)/(n-1) = (-0.127*-1.06 + -0.63*0.26 + .. + 1.27*1.59)/4 = 0.73
• it is a statistic when it is found from a sample
• it is a parameter when it belongs to a population
In the first case we use the symbol r, in the second case we use ρ .
In MINITAB we can use
Stat > Basic Statistics > Correlation
to carry out the calculations. For example we find that the correlation of "Day of the Year" and "Draft Number" is r=-0.226
Properties of the Correlation Coefficient:
• always -1 ≤ r ≤ 1
• r close to 0 means very small or even no correlation (relationship)
• r close to ±1 means a very strong correlation
• r=-1 or r=1 means a perfect linear correlation (that is in the scatterplot the dots form a straight line)
• r<0 means a negative relationship (as x gets bigger y gets smaller)
• r>0 means a positive relationship (as x gets bigger y gets bigger)
• r treats x and y symmetricaly, that is cor(x,y) = cor(y,x)
Peason's correlation coefficient only measures linear relationships, it does not work if a relationship is nonlinear. As examples consider the following, all of which have clearly about the same strength of relationship:
Peason's correlation coefficient is only useful for the first case. Another situation where Pearson's correlation coefficient does not work is if there are outliers in the dataset. Even just one outlier can determine the correlation coefficient:
Weak vs. no Correlation
It is important to keep two things separate: a situation with two variables which are uncorrelated (ρ=0) and two variables with a weak correlation (ρ≠0 but small). In either case we would find an r close to 0 (but never = 0 !) Finding out which case it is might be impossible, especially for small datasets.
App - correlation
this app of RolkeShinyApp illustrates the correlation coeffcient.
Move slider around to see different cases of the scatterplot of correlated variables
include a few outliers and see how that effects that "look" of the scatterplot and the sample correlation coefficient
On the Histogram tab we can study the effect of changing ρ and/or n on the sample correlation r.
So, how about the draft? Well, we found r=-0.226. But of course the question is whether -0.226 is close to 0, close enough to conclude that all went well. Actually, the question really is whether the corresponding parameter ρ=0! Let's do a simulation:
• there are the numbers 1-366 in the order from 1 to 366 (in "Day of the Year")
• there are the numbers 1-366 in some random order (in "Draft Number")
In MINITAB we can do this as follows: get the numbers in Day of the Year in random order using
Calc > Random Data > Sample from Columns, Sample 366 rows from "Day of the Year", store in c2
Then we can find the correlation coeffcient of "Day of the Year" and c6.
Now of course we should do this many many times. We can see what happens again with the correlation app: choose 366 for the sample size and -0.22 for the correlation coefficient, then switch to the Histogram tab. As you can see, none of the simulations had a sample correlation as large as 0! So either
• The draft went fine, but something extremely unlikely happened (something with a probability less than 1 in 1000)
• Something went wrong in the draft.
A probability of less than 1 in 1000 is generally considered to unlikely, so we will conclude that something did go wrong.
So the next time you see a sample correlation coefficient r=-0.226, can you again conclude that the corresponding population correlation coefficient ρ≠0? Unfortunately no! For example, say that instead of using the day of the year the military had used the day of the month (1-31). Now choose 31 for the sample size and -0.22 for the correlation coefficient, then switch to the Histogram tab. As you can see, now about 23% of the simulations had a sample correlation as large as 0, so this would not be unusual.
Always in these calculations you have to consider the sample size, for a large one (say 366) we can distinguish 0 from -0.226, for a small one (say 31) we cannnot.
Of course what we have just done is in essence the following:
1) Parameter: Pearson's correlation coefficient ρ
2) Method: Test for Pearson's correlation coefficient ρ
3) Assumptions: Data comes from normal distribution. Checked.
4) α = 0.05
5) H0: ρ=0 (no relationship between Day of Year and Draft Number)
6) Ha: ρ≠0 (some relationship between Day of Year and Draft Number)
7) p<1/1000 (from simulation)
8) p<α = 0.05, so we reject the null hypothesis,
9) There is a statistically significant relationship between Day of Year and Draft Number.
MINITAB actually does this test everytime you calculate a correlation
Here is a little more on the 1970's Military Draft
One possible explanation is a Cause-Effect relationship. This implies that if we can "change" x we expect a change in y.
x = "hours you study for an exam"
y = "score on exam"
Example Say we have the following data: for one year in some city we have data on fires that happenend during the year. Specifically we recorded
x = "Number of fireman responding to a fire"
y = "damages done by the fire"
say there is a positive correlation between x and y (and in real live there will be!).
Now if the correlation is due to a Cause-Effect, than changing x means changing y. Clearly we want a small y (little or no damamges), and because the correlation is positive we can get that by making sure x is small. So never call the firebrigade!
If this does not make sense, there has to be another explanation for the positive correlation:
Under the latent variable explanation we find (if all correlations are positive):
small z leads to small x and small y, so we get pairs of small (x,y)
large z leads to large x and large y, so we get pairs of large (x,y)
Finally cor(x,y) is positive!
Online Resource: Bizzare Correlations
x = "hours you study for an exam
y = "score on exam"
but there are also many other factors that determine your score in an exam such as
• general ability
• previous experience
• being healthy on the day of the exam
• exam anxiety
• having a hang-over
There have been hundreds of studies all over the world that have shown a correlation between smoking rates and lung cancer deaths, usually with correlations of about 0.5 to 0.7. And yet, none of these studies has shown that smoking causes lung cancer because all of the were observational studies, not clinical trial.
The only perfectly satisfactory way to establish a causation is to find a random sample, for example to do a clinical trial. An observational study is always somewhat suspect because we never know about hidden biases. Nevertheless, even only using observational studies the evidence for cause-effect can be quite strong:
Things to look for when trying to establish a causation:
• correlation is strong - the correlation between smoking and lung cancer is very strong
• correlation is consistent over many experiments - many studies of different kinds of people in different countries over a long time period all have shown this correlation
• higher doses are associated with stronger responses - people who smoke more have a higher chance of lung cancer
• the cause comes before the response in time - lung cancer develops after years of smoking. The number of men dying of lung cancer rose as smoking became more common, with a lag of about 30 years. Lung cancer kills more men than any other form of cancer. Lung cancer was rare among women until women started to smoke. Lung cancer in women rose along with smoking, again with a lag of about 30 years, and has now passed breast cancer as the leading cause of cancer deaths among women.
• the cause is plausible - lab experiments on animals show that nicotin causes cancer.
Clearly we want to know whether there is a relationship between the drug use of the mother and the length of the baby. So again we have two variables (Drug Use and Length) and we want to find their "correlation" . Only Drug Use is a categorical variable, so how can we calculate r? here is an idea: let's code Drug Use as follows:
Drug Free = 1
1st Trimester = 2
Throughout = 3
This is easy to do with the Data > Code > Text to Numeric command. Then
Stat > Basic Statistics > Correlation gives r=-0.4111 (p-val=0.000)
Strictly speaking this analysis is wrong. Coding a categorical variable does not make it a quantitative one, and Pearson's correlation coefficient is meant for two quantitative variables. Nevertheless, on occasion we do this kind of thing anyway. In this specific example, though, there is perfectly good method to analyse the data called Analysis of Variance. If you are interested, come to ESMA 3102.
For more on the correlation coefficient see section 4.1 of the textbook.
For each of the following datasets calculate the correlation coefficient:
∑x = 110.44
X̅ = 110.44/10 = 11.044
x-X̅: -3.974 3.226 4.136 -2.914 3.396 -1.054 0.926 1.766 -3.844 -1.664
S2 = 9.551
s = 3.091
zx: -1.286 1.044 1.338 -0.943 1.099 -0.341 0.300 0.571 -1.244 -0.538
∑y = 414.62
Y̅ = 414.62/10 = 41.462
y-Y̅: -12.842 8.548 16.078 -19.682 10.128 5.878 2.128 6.718 -14.222 -2.732
s = 11.93
zy: -1.077 0.717 1.348 -1.650 0.849 0.493 0.178 0.563 -1.192 -0.229
r = (∑zxzy)(n-1) =0.9154
∑x = 124
X̅ = 124/12 = 10.33
x-X̅: 2.67 0.67 -2.33 -4.33 -4.33 1.67 0.67 -2.33 6.67 -0.33 -3.33 4.67
S2 = 12.425
s = 3.52
zx: 0.8 0.2 -0.7 -1.2 -1.2 0.5 0.2 -0.7 1.9 -0.1 -0.9 1.3
∑y = 228
Y̅ = 228/12 = 19
y-Y̅: -9 3 8 1 8 -3 -2 0 -8 -1 4 -1
S2 = 28.55
s = 5.34
zy: -1.68 0.56 1.50 0.19 1.50 -0.56 -0.37 0.00 -1.50 -0.19 0.75 -0.19
r = (∑zxzy)(n-1) = -0.765
X̅ = 10.93
x-X̅: 0.5 1.09 -1.99 -3.58 -3.03 0.15 2.58 -1.89 0.94 -5.05 4.74 0.05 3.64 -3.66 -3.55 -1.26 6.37 2.43 -0.67 2.21
S2 = 9.43
sx = 3.07
n = 20
Y̅ = 51.17
y-Y̅: -10.8 9.24 3.84 3.45 -27.98 14.08 4.5 2.08 -5.43 2.77 5.07 5.32 -1.46 -2.03 11.34 4.75 -21.79 6.61 -12.73 9.23
S2 = 118.35
sy = 10.88
zx = 0.16 0.35 -0.65 -1.17 -0.99 0.05 0.84 -0.62 0.31 -1.64 1.54 0.02 1.19 -1.19 -1.16 -0.41 2.07 0.79 -0.22 0.72
zy = -0.99 0.85 0.35 0.32 -2.57 1.29 0.41 0.19 -0.5 0.25 0.47 0.49 -0.13 -0.19 1.04 0.44 -2 0.61 -1.17 0.85
r = (∑zxzy)/(n-1) = -0.084