Generally if there are two or more variables we are interested in their relationships. We want to investigate the two questions:
Is there a relationship?
If there is a relationship, can we describe it?
If both variables are quantitative, for the first question we can find the correlation and for the second we can do a regression.
Data on the gold medal winning performances in the men’s long jump for the modern Olympic games.
longjump
## City Year LongJump
## 1 Athens 1896 249.7500
## 2 Paris 1900 282.8750
## 3 St.Louis 1904 289.0000
## 4 London 1908 294.5000
## 5 Stockholm 1912 299.2500
## 6 Antwerp 1920 281.5000
## 7 Paris 1924 293.1250
## 8 Amsterdam 1928 304.7500
## 9 LosAngeles 1932 300.7500
## 10 Berlin 1936 317.3125
## 11 London 1948 308.0000
## 12 Helsinki 1952 298.0000
## 13 Melbourne 1956 308.2500
## 14 Rome 1960 319.7500
## 15 Tokyo 1964 317.7500
## 16 MexicoCity 1968 350.5000
## 17 Munich 1972 324.5000
## 18 Montreal 1976 328.5000
## 19 Moscow 1980 336.2500
## 20 LosAngeles 1984 336.2500
## 21 Seoul 1988 343.2500
## 22 Barcelona 1992 341.5000
## 23 Atlanta 1996 334.7500
## 24 Sydney 2000 336.7500
## 25 Athens 2004 338.1880
## 26 Bejing 2008 328.3460
## 27 London 2012 327.1719
## 28 Rio de Janeiro 2016 329.9213
Both Year and LongJump are quantitative. Neither is very interesting by itself, the real interest is in their relationship. What does the year tell us about the length of the jump?
Here we usually start by drawing a scatterplot.
attach(longjump)
splot(LongJump, Year)
In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.
Question: was the draft was really “random”?
head(draft[, 4:5])
## Day.of.Year Draft.Number
## 1 1 305
## 2 2 159
## 3 3 251
## 4 4 215
## 5 5 101
## 6 6 224
Let’s have a look at the scatterplot of “Day.of.Year” and “Draft.Number”:
attach(draft)
splot(Draft.Number, Day.of.Year)
It certainly does not appear that there is a relationship between “Day of the Year” and “Draft Number”, but is this really true? What we want is a number that can tell us if there is a relationship between two quantitative variables, and if so how strong it is. Consider the following two examples:
Clearly in the case on the left we have a much stronger relationship than on the right. For example, if I knew x = 0.5, then on the left I could reasonably guess that y is between 0.4 and 0.6, whereas on the right I could only guess 0.1 to 1.0.
The most popular choice for such a number is Pearson’s correlation coefficient r, which we can find with the R command cor:
cor(Draft.Number, Day.of.Year)
## [1] -0.2260414
correlations are usually rounded to three digits:
round(cor(Draft.Number, Day.of.Year), 3)
## [1] -0.226
The correlation coefficient is like the mean, median, standard deviation, Q1 etc.: it comes in two versions:
In the first case we use the symbol r, in the second case we use \(\mathbf{\rho}\).
always \(-1 \le r \le 1\)
r close to 0 means very small or even no correlation (relationship)
r = -1 or r = 1 means a perfect linear correlation (that is in the scatterplot the dots form a straight line)
r > 0 means a positive relationship (as x gets bigger y gets bigger)
r < 0 means a negative relationship (as x gets bigger y gets smaller)
r treats x and y symmetrically, that is cor(x,y) = cor(y,x)
Pearson’s correlation coefficient only measures linear relationships, it does not work if a relationship is nonlinear.
Here is an example:
All for of these have about the same “strength” of a relationship.
BUT
## cor(x, y1) = 0.986
## cor(x, y2) = 0.937
## cor(x, y3) = 0.88
## cor(x, y4) = -0.111
Pearson’s correlation coefficient is only useful for the first case.
Another situation where Pearson’s correlation coefficient does not work is if there are outliers in the data set. Even just one outlier can determine the correlation coefficient:
## Correlation between x1 and y1: -0.154
## Correlation between x2 and y2: 0.948
It is important to keep two things separate: a situation with two variables which are uncorrelated (\(\rho = 0\)) and two variables with a weak correlation (\(\rho \ne 0\) but small).
In either case we would find an r close to 0 (but never = 0 !) Finding out which case it is might be impossible, especially for small data sets.
run.app(correlation)
this app illustrates the correlation coefficient.
Move slider around to see different cases of the scatterplot of correlated variables
include a few outliers and see how that effects that “look” of the scatterplot and the sample correlation coefficient
On the Histogram tab we can study the effect of changing \(\rho\) and/or n on the sample correlation r.
Back to the draft
So, how about the draft? Well, we found r = -0.226. But of course the question is whether -0.226 is close to 0, close enough to conclude that all went well. Actually, the question really is whether the corresponding parameter \(\rho=0\)! Let’s do a simulation:
Doing a simulation means teaching the computer to repeat the essential part of an experiment many times. Here the experiment is the draft. What are the important features of this experiment?
there are the numbers 1-366 in the order from 1 to 366 (in “Day.of.Year”)
there are the numbers 1-366 in some random order (in “Draft.Number”)
In R we can do this as follows:
sample(Day.of.Year)
## [1] 169 349 260 145 43 87 311 157 179 248 271 56 301 335 208 177 67
## [18] 1 306 47 8 325 101 57 137 359 148 176 287 19 187 231 220 221
## [35] 341 321 133 280 249 175 285 152 3 236 353 61 207 173 107 203 263
## [52] 224 110 41 122 289 250 235 295 348 186 9 103 85 36 14 112 344
## [69] 52 185 118 282 193 97 58 34 360 314 162 211 310 164 4 257 218
## [86] 108 27 354 229 230 181 37 94 200 247 189 66 174 76 256 362 111
## [103] 350 25 217 138 269 106 308 190 90 242 165 305 151 105 20 132 239
## [120] 339 342 60 283 159 201 332 116 119 91 318 294 219 35 327 227 226
## [137] 81 238 262 316 212 78 120 121 243 267 291 75 246 13 184 46 240
## [154] 30 69 328 54 330 102 195 351 77 65 366 92 64 286 228 258 338
## [171] 178 59 276 39 153 266 6 163 129 206 126 156 264 72 104 18 234
## [188] 259 134 154 270 140 11 281 62 299 297 334 63 113 128 361 136 86
## [205] 82 68 292 273 48 44 223 170 23 79 278 205 160 255 2 326 12
## [222] 74 357 356 331 192 265 24 196 204 127 312 171 333 143 10 355 322
## [239] 155 345 202 320 15 293 210 53 284 194 50 209 365 70 182 80 213
## [256] 95 275 149 347 244 225 150 141 237 73 31 29 83 253 5 317 298
## [273] 251 26 337 180 191 309 364 125 117 252 319 88 290 16 147 144 89
## [290] 167 245 146 198 277 199 358 40 51 268 158 32 135 340 168 115 329
## [307] 109 45 183 343 188 172 55 93 254 98 272 124 22 324 130 84 42
## [324] 49 216 166 131 21 261 274 38 161 96 232 215 323 99 28 363 214
## [341] 100 313 296 336 139 71 307 114 123 288 304 241 300 279 17 142 233
## [358] 315 33 352 222 7 197 346 302 303
cor(Day.of.Year, sample(Day.of.Year))
## [1] 0.01648229
Now of course we should do this many times:
cor(Day.of.Year, sample(Day.of.Year))
## [1] -0.08308161
cor(Day.of.Year, sample(Day.of.Year))
## [1] -0.06702214
cor(Day.of.Year, sample(Day.of.Year))
## [1] 0.01222493
cor(Day.of.Year, sample(Day.of.Year))
## [1] 0.1354924
cor(Day.of.Year, sample(Day.of.Year))
## [1] -0.01182866
And so we have not yet seen a correlation as far from 0 as 0.226. But maybe we need to do it many more times than that. Here is how:
r <- rep(0, 10000)
for(i in 1:10000)
r[i] <- cor(Day.of.Year, sample(Day.of.Year))
hplot(r)
length(r[abs(r)>0.226])
## [1] 0
As you can see, none of the 10000 simulations had a sample correlation as far from 0 as -0.226! So either
The draft went fine, but something extremely unlikely happened (something with a probability less than 1 in 10000)
Something went wrong in the draft.
A probability of less than 1 in 10000 is generally considered to unlikely, so we will conclude that something did go wrong.
So the next time you see a sample correlation coefficient r=-0.226, can you again conclude that the corresponding population correlation coefficient \(\rho \ne 0\)? Unfortunately no! For example, say that instead of using the day of the year the military had used the day of the month (1-31). Now we do the simulation
r <- rep(0, 10000)
for(i in 1:10000)
r[i] <- cor(1:31, sample(1:31))
hplot(r)
length(r[abs(r)>0.226])
## [1] 2142
As you can see, now quite a few of the simulations had a sample correlation as far from 0 as - 0.226 (about 22%), so this would not be unusual.
In a while we learn more about doing simulations as well as how to decide whether something is or is not statistically significant.
Say we have found correlation between variables “x” and “y”. How can we understand and interpret that relationship?
One possible explanation is a Cause-Effect relationship. This implies that if we can “change” x we expect a change in y.
x = “hours you study for an exam”
y = “score on exam”
Say we have the following data: for one year in some city we have data on fires that happened during the year. Specifically we recorded
x = “Number of fireman responding to a fire”
y = “damages done by the fire”
say there is a positive correlation between x and y (and in real live there will be!).
Now if the correlation is due to a Cause-Effect, than changing x means changing y. Clearly we want a small y (little or no damages), and because the correlation is positive we can get that by making sure x is small. So never call the fire brigade!
If this does not make sense, there has to be another explanation for the positive correlation:
Under the confounding variable explanation we find (if all correlations are positive):
small z leads to small x and small y, so we get pairs of small (x, y)
large z leads to large x and large y, so we get pairs of large (x, y)
Finally cor(x, y) is positive!
[Online Resource: Bizarre Correlations] (http://www.buzzfeed.com/kjh2110/the-10-most-bizarre-correlations“)
Please note saying x causes y is not the same as x by itself determines y.
There are usually many other factors besides x that influence y, maybe even some more important than x.
x = “hours you study for an exam
y = “score on exam”
but there are also many other factors that determine your score in an exam such as
There have been hundreds of studies all over the world that have shown a correlation between smoking rates and lung cancer deaths, usually with correlations of about 0.5 to 0.7. And yet, none of these studies has shown that smoking causes lung cancer because all of the were observational studies, not clinical trial.
The only perfectly satisfactory way to establish a causation is to find a random sample, for example to do a clinical trial. An observational study is always somewhat suspect because we never know about hidden biases. Nevertheless, even only using observational studies the evidence for cause-effect can be quite strong:
Things to look for when trying to establish a causation:
correlation is strong - the correlation between smoking and lung cancer is very strong
correlation is consistent over many experiments - many studies of different kinds of people in different countries over a long time period all have shown this correlation
higher doses are associated with stronger responses - people who smoke more have a higher chance of lung cancer
the cause comes before the response in time - lung cancer develops after years of smoking. The number of men dying of lung cancer rose as smoking became more common, with a lag of about 30 years. Lung cancer kills more men than any other form of cancer. Lung cancer was rare among women until women started to smoke. Lung cancer in women rose along with smoking, again with a lag of about 30 years, and has now passed breast cancer as the leading cause of cancer deaths among women.
the cause is plausible - lab experiments on animals show that nicotine causes cancer.
Old Faithful is a cone geyser located in Yellowstone National Park in Wyoming, United States.
head(faithful)
## Eruptions Waiting.Time
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
attach(faithful)
splot(Eruptions, Waiting.Time)
round(cor(Eruptions, Waiting.Time), 3)
## [1] 0.901
If the data is like this (two columns), than you can just find
round(cor(faithful), 3)
## Eruptions Waiting.Time
## Eruptions 1.000 0.901
## Waiting.Time 0.901 1.000
Notice the curious hole in the middle. It has geological reasons.