Correlation

Generally if there are more than two variables we are interested in their relationships. We want to investigate the two questions:

1) Is there a relationship?
2) If there is a relationship, can we describe it?

If both variables are quantitative, for the first question we can find the correlation and for the second we can do a regression.

Correlation

Case Study: The 1970's Military Draft

Data set: draft

In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.

Question: was the draft was really "random"?

Let's have a look at the scatterplot of "Day of the Year" and "Draft Number".

It certainly does not appear that there is a relationship between "Day of the Year" and "Draft Number", but is this really true?

What we want is a number that can tell is if there is a relationship between two variables, and if so how strong it is. Consider the following two examples:

Clearly in the case on the left we have a much stronger relationship than on the right. For example, if i knew x=0.5, then on the left i could reasonably guess that y is between 0.4 and 0.6, whereas on the right i could only guess 0.1 to 0.9.

The most popular choice for such a number is Pearson's correlation coefficient r

Computation

Using what we have learned before the formula of Pearson's correlation coefficient r is actually very simple. First recall the z scores:

z = (x-X̅)/s

of course now we have two variables x and y, so there are two z scores:

zx = (x-X̅)/sxand zy = (y-Y̅)/sy

Then

r = (∑zxzy)/(n-1)

As a simple numerical example consider the following:
x 1 2 3 4 5
y 1 3 2 2 5


and then we get:

For x:
n=5
∑x = 15
X̅=15/5=3

x-X̅: -2 -1 0 1 2

S2= 10/4 =2.5

s = √2.5 = 1.58

zx: -1.27 -0.63 0.00 0.63 1.27

∑y = 13
Y̅=13/5=2.6

y-Y̅: -1.6 0.4 -0.6 -0.6 2.4

S2 = 9.2/4 = 2.3

s = √2.3 = 1.51

zx: -1.27   -0.63   0.00   0.63   1.27
zy: -1.06    0.26  -0.40  -0.40   1.59

r = (∑zxzy)/(n-1) = (-0.127*-1.06 + -0.63*0.26 + .. + 1.27*1.59)/4 = 0.73


The correlation coefficient is like the mean, median, standard deviation, Q1 etc.: it comes in two versions:

• it is a statistic when it is found from a sample
• it is a parameter when it belongs to a population

In the first case we use the symbol r, in the second case we use ρ .

In MINITAB we can use

Stat > Basic Statistics > Correlation

to carry out the calculations. For example we find that the correlation of "Day of the Year" and "Draft Number" is r=-0.226

Properties of the Correlation Coefficient:

• always -1 ≤ r ≤ 1
• r close to 0 means very small or even no correlation (relationship)
• r close to ±1 means a very strong correlation
• r=-1 or r=1 means a perfect linear correlation (that is in the scatterplot the dots form a straight line)
• r<0 means a negative relationship (as x gets bigger y gets smaller)
• r>0 means a positive relationship (as x gets bigger y gets bigger)
• r treats x and y symmetricaly, that is cor(x,y) = cor(y,x)

Peason's correlation coefficient only measures linear relationships, it does not work if a relationship is nonlinear. As examples consider the following, all of which have clearly about the same strength of relationship:

Peason's correlation coefficient is only useful for the first case. Another situation where Pearson's correlation coefficient does not work is if there are outliers in the dataset. Even just one outlier can determine the correlation coefficient:

Weak vs. no Correlation

It is important to keep two things separate: a situation with two variables which are uncorrelated (ρ=0) and two variables with a weak correlation (ρ≠0 but small). In either case we would find an r close to 0 (but never = 0 !) Finding out which case it is might be impossible, especially for small datasets.

Back to the draft

So, how about the draft? Well, we found r=-0.226. But of course the question is whether -0.226 is close to 0, close enough to conclude that all went well. Actually, the question really is whether the corresponding parameter ρ=0! Let's do a simulation:

Simulation for the 1970's Military Draft

Doing a simulation means teaching the computer to repeat the essential part of an experiment many times. Here the experiment is the draft. What are the important features of this experiment?

• there are the numbers 1-366 in the order from 1 to 366 (in "Day of the Year")

• there are the numbers 1-366 in some random order (in "Draft Number")

In MINITAB we can do this as follows: get the numbers in Day of the Year in random order using

Calc > Random Data > Sample from Columns, Sample 366 rows from "Day of the Year", store in c2

Then we can find the correlation coeffcient of "Day of the Year" and c6.

The MINITAB Macro draftsim repeats the steps above 1000 times.

%k:\3101\draftsim 'Day of Year'


Here I have the the result of one such run:

The MACRO also calculates the percentage of runs with a correlation coefficient farther from 0 than -0.226, shown in the session window. Of the 1000 simulation runs none had an r as far away from 0 as -0.226. This means one of two things happened:

• The draft went fine, but something extremely unlikely happened (something with a probability less than 1 in 1000)
• Something went wrong in the draft.

A probability of less than 1 in 1000 is generally considered to unlikely, so we will conclude that something did go wrong.

So the next time you see a sample correlation coefficient r=-0.226, can you again conclude that the corresponding population correlation coefficient ρ≠0? Unfortunately no! For example, say that instead of using the day of the year the military had used the day of the month (1-31). Now it looks as follows:

Make a column call Day of Month with 1-31
hit CTRL-L, type (or copy-paste): %k:\3101\draftsim 'Day of Month'
Here I have the the result of one such run:

and we see that now about 22% of the simulation runs have |r|>0.226! Always in these calculations you have to consider the sample size, for a large one (say 366) we can distinguish 0 from -0.226, for a small one (say 31) we cannnot.

Of course what we have just done is in essence the following:

1) Parameter: Pearson's correlation coefficient ρ
2) Method: Test for Pearson's correlation coefficient ρ
3) Assumptions: Data comes from normal distribution. Checked.
4) α = 0.05
5) H0: ρ=0 (no relationship between Day of Year and Draft Number)
6) Ha: ρ≠0 (some relationship between Day of Year and Draft Number)
7) p<1/1000 (from simulation)
8) p<α = 0.05, so we reject the null hypothesis,
9) There is a statistically significant relationship between Day of Year and Draft Number.

MINITAB actually does this test everytime you calculate a correlation

Here is a little more on the 1970's Military Draft

Correlation vs. Causation


Say we have found correlation between variables "x" and "y". How can we understand and interpret that relationship?

One possible explanation is a Cause-Effect relationship. This implies that if we can "change" x we expect a change in y.

Example

x = "hours you study for an exam"

y = "score on exam"

Example Say we have the following data: for one year in some city we have data on fires that happenend during the year. Specifically we recorded

x = "Number of fireman responding to a fire"
y = "damages done by the fire"

say there is a positive correlation between x and y (and in real live there will be!).

Now if the correlation is due to a Cause-Effect, than changing x means changing y. Clearly we want a small y (little or no damamges), and because the correlation is positive we can get that by making sure x is small. So never call the firebrigade!

If this does not make sense, there has to be another explanation for the positive correlation:

Under the latent variable explanation we find (if all correlations are positive):

small z leads to small x and small y, so we get pairs of small (x,y)

large z leads to large x and large y, so we get pairs of large (x,y)

Finally cor(x,y) is positive!

Online Resource: Bizzare Correlations


Please note saying "x causes y" is not the same as "x determines y". There are usually many other factors besides x that influence y, maybe even some more important than x.

Example

x = "hours you study for an exam

y = "score on exam"

but there are also many other factors that determine your score in an exam such as

• general ability
• previous experience
• being healthy on the day of the exam
• exam anxiety
• having a hang-over
• etc.

Case Study: Smoking and Lung Cancer

There have been hundreds of studies all over the world that have shown a correlation between smoking rates and lung cancer deaths, usually with correlations of about 0.5 to 0.7. And yet, none of these studies has shown that smoking causes lung cancer because all of the were observational studies, not clinical trial.

The only perfectly satisfactory way to establish a causation is to find a random sample, for example to do a clinical trial. An observational study is always somewhat suspect because we never know about hidden biases. Nevertheless, even only using observational studies the evidence for cause-effect can be quite strong:

Things to look for when trying to establish a causation:

•   correlation is strong - the correlation between smoking and lung cancer is very strong
•   correlation is consistent over many experiments - many studies of different kinds of people in different countries over a long time period all have shown this correlation
•   higher doses are associated with stronger responses - people who smoke more have a higher chance of lung cancer
•   the cause comes before the response in time - lung cancer develops after years of smoking. The number of men dying of lung cancer rose as smoking became more common, with a lag of about 30 years. Lung cancer kills more men than any other form of cancer. Lung cancer was rare among women until women started to smoke. Lung cancer in women rose along with smoking, again with a lag of about 30 years, and has now passed breast cancer as the leading cause of cancer deaths among women.
•   the cause is plausible - lab experiments on animals show that nicotin causes cancer.

Case Study: Drug Use of Mothers and the Health of the Newborn

Data set: cocain

Clearly we want to know whether there is a relationship between the drug use of the mother and the length of the baby. So again we have two variables (Drug Use and Length) and we want to find their "correlation" . Only Drug Use is a categorical variable, so how can we calculate r? here is an idea: let's code Drug Use as follows:

Drug Free = 1
1st Trimester = 2
Throughout = 3

This is easy to do with the Data > Code > Text to Numeric command. Then

Stat > Basic Statistics > Correlation gives r=-0.4111 (p-val=0.000)

Strictly speaking this analysis is wrong. Coding a categorical variable does not make it a quantitative one, and Pearson's correlation coefficient is meant for two quantitative variables. Nevertheless, on occasion we do this kind of thing anyway. In this specific example, though, there is perfectly good method to analyse the data called Analysis of Variance. if you are ineterested, come to ESMA 3102.

For more on the correlation coefficient see section 4.1 of the textbook.

________________________________________________________________

Practice Exercise

For each of the following datasets calculate the correlation coefficient:

a)
x 7.07 14.27 15.18 8.13 14.44 9.99 11.97 12.81 7.20 9.38
y 28.62 50.01 57.54 21.78 51.59 47.34 43.59 48.18 27.24 38.73

b)

x 13 11 8 6 6 12 11 8 17 10 7 15
y 10 22 27 20 27 16 17 19 11 18 23 18

c)

x 11.43 12.02 8.94 7.35 7.90 11.08 13.51 9.04 11.87 5.88 15.67 10.98 14.57 7.27 7.38 9.67 17.30 13.36 10.26 13.14
y 40.37 60.41 55.01 54.62 23.19 65.25 55.67 53.25 45.74 53.94 56.24 56.49 49.71 49.14 62.51 55.92 29.38 57.78 38.44 60.40
 

____________________________________________________________
____________________________________________________________

a)

x 7.07 14.27 15.18 8.13 14.44 9.99 11.97 12.81 7.20 9.38
y 28.62 50.01 57.54 21.78 51.59 47.34 43.59 48.18 27.24 38.73

n=10

∑x = 110.44

X̅ = 110.44/10 = 11.044

x-X̅: -3.974 3.226 4.136 -2.914 3.396 -1.054 0.926 1.766 -3.844 -1.664

S2 = 9.551

s = 3.091

zx: -1.286 1.044 1.338 -0.943 1.099 -0.341 0.300 0.571 -1.244 -0.538

∑y = 414.62

Y̅ = 414.62/10 = 41.462

y-Y̅: -12.842 8.548 16.078 -19.682 10.128 5.878 2.128 6.718 -14.222 -2.732

S2 =142.265

s = 11.93

zy: -1.077 0.717 1.348 -1.650 0.849 0.493 0.178 0.563 -1.192 -0.229

r = (∑zxzy)(n-1) =0.9154

b)

x 13 11 8 6 6 12 11 8 17 10 7 15
y 10 22 27 20 27 16 17 19 11 18 23 18

n=12

∑x = 124

X̅ = 124/12 = 10.33

x-X̅: 2.67 0.67 -2.33 -4.33 -4.33 1.67 0.67 -2.33 6.67 -0.33 -3.33 4.67

S2 = 12.425

s = 3.52

zx: 0.8 0.2 -0.7 -1.2 -1.2 0.5 0.2 -0.7 1.9 -0.1 -0.9 1.3

∑y = 228

Y̅ = 228/12 = 19

y-Y̅: -9 3 8 1 8 -3 -2 0 -8 -1 4 -1

S2 = 28.55

s = 5.34

zy: -1.68 0.56 1.50 0.19 1.50 -0.56 -0.37 0.00 -1.50 -0.19 0.75 -0.19

r = (∑zxzy)(n-1) = -0.765


c)

11.43 12.02 8.94 7.35 7.90 11.08 13.51 9.04 11.87 5.88 15.67 10.98 14.57 7.27 7.38 9.67 17.30 13.36 10.26 13.14
40.37 60.41 55.01 54.62 23.19 65.25 55.67 53.25 45.74 53.94 56.24 56.49 49.71 49.14 62.51 55.92 29.38 57.78 38.44 60.40
n = 20

X̅ = 10.93

x-X̅: 0.5 1.09 -1.99 -3.58 -3.03 0.15 2.58 -1.89 0.94 -5.05 4.74 0.05 3.64 -3.66 -3.55 -1.26 6.37 2.43 -0.67 2.21

S2 = 9.43

sx = 3.07

n = 20

Y̅ = 51.17

y-Y̅: -10.8 9.24 3.84 3.45 -27.98 14.08 4.5 2.08 -5.43 2.77 5.07 5.32 -1.46 -2.03 11.34 4.75 -21.79 6.61 -12.73 9.23

S2 = 118.35

sy = 10.88

zx = 0.16 0.35 -0.65 -1.17 -0.99 0.05 0.84 -0.62 0.31 -1.64 1.54 0.02 1.19 -1.19 -1.16 -0.41 2.07 0.79 -0.22 0.72

zy = -0.99 0.85 0.35 0.32 -2.57 1.29 0.41 0.19 -0.5 0.25 0.47 0.49 -0.13 -0.19 1.04 0.44 -2 0.61 -1.17 0.85

r = (∑zxzy)/(n-1) = -0.084