Two Quantitative Variables - Correlation

Generally if there are two or more variables we are interested in their relationships. We want to investigate the two questions:

  1. Is there a relationship?

  2. If there is a relationship, can we describe it?

If both variables are quantitative, for the first question we can find the correlation and for the second we can do a regression.

Case Study: Olympic Men’s Long Jump

Data on the gold medal winning performances in the men’s long jump for the modern Olympic games.

longjump
##              City Year LongJump
## 1          Athens 1896 249.7500
## 2           Paris 1900 282.8750
## 3        St.Louis 1904 289.0000
## 4          London 1908 294.5000
## 5       Stockholm 1912 299.2500
## 6         Antwerp 1920 281.5000
## 7           Paris 1924 293.1250
## 8       Amsterdam 1928 304.7500
## 9      LosAngeles 1932 300.7500
## 10         Berlin 1936 317.3125
## 11         London 1948 308.0000
## 12       Helsinki 1952 298.0000
## 13      Melbourne 1956 308.2500
## 14           Rome 1960 319.7500
## 15          Tokyo 1964 317.7500
## 16     MexicoCity 1968 350.5000
## 17         Munich 1972 324.5000
## 18       Montreal 1976 328.5000
## 19         Moscow 1980 336.2500
## 20     LosAngeles 1984 336.2500
## 21          Seoul 1988 343.2500
## 22      Barcelona 1992 341.5000
## 23        Atlanta 1996 334.7500
## 24         Sydney 2000 336.7500
## 25         Athens 2004 338.1880
## 26         Bejing 2008 328.3460
## 27         London 2012 327.1719
## 28 Rio de Janeiro 2016 329.9213

Both Year and LongJump are quantitative. Neither is very interesting by itself, the real interest is in their relationship. What does the year tell us about the length of the jump?

Here we usually start by drawing a scatterplot.

attach(longjump)
splot(LongJump, Year)

Case Study: The 1970’s Military Draft

In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.

Question: was the draft was really “random”?

head(draft[, 4:5])
##   Day.of.Year Draft.Number
## 1           1          305
## 2           2          159
## 3           3          251
## 4           4          215
## 5           5          101
## 6           6          224

Let’s have a look at the scatterplot of “Day.of.Year” and “Draft.Number”:

attach(draft)
splot(Draft.Number, Day.of.Year)

It certainly does not appear that there is a relationship between “Day of the Year” and “Draft Number”, but is this really true? What we want is a number that can tell us if there is a relationship between two quantitative variables, and if so how strong it is. Consider the following two examples:

Clearly in the case on the left we have a much stronger relationship than on the right. For example, if I knew x = 0.5, then on the left I could reasonably guess that y is between 0.4 and 0.6, whereas on the right I could only guess 0.1 to 1.0.

The most popular choice for such a number is Pearson’s correlation coefficient r, which we can find with the R command cor:

cor(Draft.Number, Day.of.Year)
## [1] -0.2260414

correlations are usually rounded to three digits:

round(cor(Draft.Number, Day.of.Year), 3)
## [1] -0.226

The correlation coefficient is like the mean, median, standard deviation, Q1 etc.: it comes in two versions:

  • it is a statistic when it is found from a sample
  • it is a parameter when it belongs to a population

In the first case we use the symbol r, in the second case we use \(\mathbf{\rho}\).

Properties of the Correlation Coefficient

  • always \(-1 \le r \le 1\)

  • r close to 0 means very small or even no correlation (relationship)

  • r close to \(\pm 1\) means a very strong correlation
  • r = -1 or r = 1 means a perfect linear correlation (that is in the scatterplot the dots form a straight line)

  • r > 0 means a positive relationship (as x gets bigger y gets bigger)

  • r < 0 means a negative relationship (as x gets bigger y gets smaller)

  • r treats x and y symmetrically, that is cor(x,y) = cor(y,x)

Pearson’s correlation coefficient only measures linear relationships, it does not work if a relationship is nonlinear.

Here is an example:

All for of these have about the same “strength” of a relationship.

BUT

## cor(x, y1) = 0.986
## cor(x, y2) = 0.937
## cor(x, y3) = 0.88
## cor(x, y4) = -0.111

Pearson’s correlation coefficient is only useful for the first case.

Another situation where Pearson’s correlation coefficient does not work is if there are outliers in the data set. Even just one outlier can determine the correlation coefficient:

## Correlation between x1 and y1:  -0.154
## Correlation between x2 and y2:  0.948

Weak vs. no Correlation

It is important to keep two things separate: a situation with two variables which are uncorrelated (\(\rho = 0\)) and two variables with a weak correlation (\(\rho \ne 0\) but small).

In either case we would find an r close to 0 (but never = 0 !) Finding out which case it is might be impossible, especially for small data sets.

App - correlation

run.app(correlation)

this app illustrates the correlation coefficient.

Move slider around to see different cases of the scatterplot of correlated variables

include a few outliers and see how that effects that “look” of the scatterplot and the sample correlation coefficient

On the Histogram tab we can study the effect of changing \(\rho\) and/or n on the sample correlation r.


Back to the draft

So, how about the draft? Well, we found r = -0.226. But of course the question is whether -0.226 is close to 0, close enough to conclude that all went well. Actually, the question really is whether the corresponding parameter \(\rho=0\)! Let’s do a simulation:

Simulation for the 1970’s Military Draft

Doing a simulation means teaching the computer to repeat the essential part of an experiment many times. Here the experiment is the draft. What are the important features of this experiment?

  • there are the numbers 1-366 in the order from 1 to 366 (in “Day.of.Year”)

  • there are the numbers 1-366 in some random order (in “Draft.Number”)

In R we can do this as follows:

  • get the numbers in Day.of.Year in random order with the sample command:
sample(Day.of.Year)
##   [1] 169 349 260 145  43  87 311 157 179 248 271  56 301 335 208 177  67
##  [18]   1 306  47   8 325 101  57 137 359 148 176 287  19 187 231 220 221
##  [35] 341 321 133 280 249 175 285 152   3 236 353  61 207 173 107 203 263
##  [52] 224 110  41 122 289 250 235 295 348 186   9 103  85  36  14 112 344
##  [69]  52 185 118 282 193  97  58  34 360 314 162 211 310 164   4 257 218
##  [86] 108  27 354 229 230 181  37  94 200 247 189  66 174  76 256 362 111
## [103] 350  25 217 138 269 106 308 190  90 242 165 305 151 105  20 132 239
## [120] 339 342  60 283 159 201 332 116 119  91 318 294 219  35 327 227 226
## [137]  81 238 262 316 212  78 120 121 243 267 291  75 246  13 184  46 240
## [154]  30  69 328  54 330 102 195 351  77  65 366  92  64 286 228 258 338
## [171] 178  59 276  39 153 266   6 163 129 206 126 156 264  72 104  18 234
## [188] 259 134 154 270 140  11 281  62 299 297 334  63 113 128 361 136  86
## [205]  82  68 292 273  48  44 223 170  23  79 278 205 160 255   2 326  12
## [222]  74 357 356 331 192 265  24 196 204 127 312 171 333 143  10 355 322
## [239] 155 345 202 320  15 293 210  53 284 194  50 209 365  70 182  80 213
## [256]  95 275 149 347 244 225 150 141 237  73  31  29  83 253   5 317 298
## [273] 251  26 337 180 191 309 364 125 117 252 319  88 290  16 147 144  89
## [290] 167 245 146 198 277 199 358  40  51 268 158  32 135 340 168 115 329
## [307] 109  45 183 343 188 172  55  93 254  98 272 124  22 324 130  84  42
## [324]  49 216 166 131  21 261 274  38 161  96 232 215 323  99  28 363 214
## [341] 100 313 296 336 139  71 307 114 123 288 304 241 300 279  17 142 233
## [358] 315  33 352 222   7 197 346 302 303
  • and then calculate the correlation with
cor(Day.of.Year, sample(Day.of.Year)) 
## [1] 0.01648229

Now of course we should do this many times:

cor(Day.of.Year, sample(Day.of.Year)) 
## [1] -0.08308161
cor(Day.of.Year, sample(Day.of.Year)) 
## [1] -0.06702214
cor(Day.of.Year, sample(Day.of.Year)) 
## [1] 0.01222493
cor(Day.of.Year, sample(Day.of.Year)) 
## [1] 0.1354924
cor(Day.of.Year, sample(Day.of.Year)) 
## [1] -0.01182866

And so we have not yet seen a correlation as far from 0 as 0.226. But maybe we need to do it many more times than that. Here is how:

r <- rep(0, 10000)
for(i in 1:10000)
  r[i] <- cor(Day.of.Year, sample(Day.of.Year))  
hplot(r)

length(r[abs(r)>0.226])
## [1] 0

As you can see, none of the 10000 simulations had a sample correlation as far from 0 as -0.226! So either

  • The draft went fine, but something extremely unlikely happened (something with a probability less than 1 in 10000)

  • Something went wrong in the draft.

A probability of less than 1 in 10000 is generally considered to unlikely, so we will conclude that something did go wrong.

So the next time you see a sample correlation coefficient r=-0.226, can you again conclude that the corresponding population correlation coefficient \(\rho \ne 0\)? Unfortunately no! For example, say that instead of using the day of the year the military had used the day of the month (1-31). Now we do the simulation

r <- rep(0, 10000)
for(i in 1:10000)
  r[i] <- cor(1:31, sample(1:31)) 
hplot(r)

length(r[abs(r)>0.226])
## [1] 2142

As you can see, now quite a few of the simulations had a sample correlation as far from 0 as - 0.226 (about 22%), so this would not be unusual.

In a while we learn more about doing simulations as well as how to decide whether something is or is not statistically significant.

Correlation vs. Causation

Say we have found correlation between variables “x” and “y”. How can we understand and interpret that relationship?

One possible explanation is a Cause-Effect relationship. This implies that if we can “change” x we expect a change in y.

Example

x = “hours you study for an exam”

y = “score on exam”

Example

Say we have the following data: for one year in some city we have data on fires that happened during the year. Specifically we recorded

x = “Number of fireman responding to a fire”

y = “damages done by the fire”

say there is a positive correlation between x and y (and in real live there will be!).

Now if the correlation is due to a Cause-Effect, than changing x means changing y. Clearly we want a small y (little or no damages), and because the correlation is positive we can get that by making sure x is small. So never call the fire brigade!

If this does not make sense, there has to be another explanation for the positive correlation:

Under the confounding variable explanation we find (if all correlations are positive):

small z leads to small x and small y, so we get pairs of small (x, y)

large z leads to large x and large y, so we get pairs of large (x, y)

Finally cor(x, y) is positive!

[Online Resource: Bizarre Correlations] (http://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations“)


Please note saying x causes y is not the same as x by itself determines y.

There are usually many other factors besides x that influence y, maybe even some more important than x.

Example

x = “hours you study for an exam

y = “score on exam”

but there are also many other factors that determine your score in an exam such as

  • general ability
  • previous experience
  • being healthy on the day of the exam
  • exam anxiety
  • having a hang-over
  • etc.

Case Study: Smoking and Lung Cancer

There have been hundreds of studies all over the world that have shown a correlation between smoking rates and lung cancer deaths, usually with correlations of about 0.5 to 0.7. And yet, none of these studies has shown that smoking causes lung cancer because all of the were observational studies, not clinical trial.

The only perfectly satisfactory way to establish a causation is to find a random sample, for example to do a clinical trial. An observational study is always somewhat suspect because we never know about hidden biases. Nevertheless, even only using observational studies the evidence for cause-effect can be quite strong:

Things to look for when trying to establish a causation:

  • correlation is strong - the correlation between smoking and lung cancer is very strong

  • correlation is consistent over many experiments - many studies of different kinds of people in different countries over a long time period all have shown this correlation

  • higher doses are associated with stronger responses - people who smoke more have a higher chance of lung cancer

  • the cause comes before the response in time - lung cancer develops after years of smoking. The number of men dying of lung cancer rose as smoking became more common, with a lag of about 30 years. Lung cancer kills more men than any other form of cancer. Lung cancer was rare among women until women started to smoke. Lung cancer in women rose along with smoking, again with a lag of about 30 years, and has now passed breast cancer as the leading cause of cancer deaths among women.

  • the cause is plausible - lab experiments on animals show that nicotine causes cancer.

Case Study: Old Faithful Guyser

Old Faithful is a cone geyser located in Yellowstone National Park in Wyoming, United States.

head(faithful)
##   Eruptions Waiting.Time
## 1     3.600           79
## 2     1.800           54
## 3     3.333           74
## 4     2.283           62
## 5     4.533           85
## 6     2.883           55
attach(faithful)
splot(Eruptions, Waiting.Time)

round(cor(Eruptions, Waiting.Time), 3)
## [1] 0.901

If the data is like this (two columns), than you can just find

round(cor(faithful), 3)
##              Eruptions Waiting.Time
## Eruptions        1.000        0.901
## Waiting.Time     0.901        1.000

Notice the curious hole in the middle. It has geological reasons.