Quantitative Predictor - Quantitative Response: Pearson’s Correlation Coefficient

The method discussed here was originally developed by Karl Pearson

Case Study: The 1970’s Military Draft

In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.

CBS TV Broadcast

Basic question: Did the 1970 draft work the way it was supposed to?

head(draft[, 4:5])
##   Day.of.Year Draft.Number
## 1           1          305
## 2           2          159
## 3           3          251
## 4           4          215
## 5           5          101
## 6           6          224

Type of variables:

Day of Year : Values: 1, 2, 3, .. ,366 are numerical, therefore quantitative

Draft Number.: Values: 305, 159, 251, 215, … are numerical, therefore quantitative

two quantitative variables → correlation


Whenever we want ot study the relationship between two quantitative variables we should start with the scatterplot. But before we do this, let’s consider what we expect to see. The draft was designed as a lottery to make it fair, that is any man in the US should have had the same chance to be picked (or not!). In terms of their birthdays, each of the 365 days should have had the same chance of getting picked early, sometime in the middle or late. So some of the days in January should have a small Draft Number, some a large one and some in the middle. So in the scatterplot on the left (January=small Day number) we should see some dots on the bottom (small Draft) some in the middle and some on top. And exactly the same should be true for any other month:

attach(draft)
splot(Draft.Number, Day.of.Year)

so far, so good.

Now, this graph shows that there is no obvious relationship between the variables, but we need a bit more, we want there to be no relationship, we want Day of Year and Draft Number to be independent. A graph such as this one is not quite enough. So let’s calculate a statistic that measures the relationship between two quantitative variables, namely Pearson’s correlation coefficient:

cor(Draft.Number, Day.of.Year)
## [1] -0.2260414

Recall some of the properties of Pearson’s correlation coefficient: - always -1 \(\le\) r \(\le\) 1

  • r close to 0 means very small or even no correlation (relationship)

  • r close to \(\pm\) 1 means a very strong correlation

  • r = -1 or r = 1 means a perfect linear correlation (that is in the scatterplot the dots form a straight line)

  • r < 0 means a negative relationship (as x gets bigger y gets smaller)

  • r > 0 means a positive relationship (as x gets bigger y gets bigger)

  • r treats x and y symmetricaly, that is cor(x,y) = cor(y,x)

r is a statistic (a number calculated from a sample) so it has a corresponding parameter (a number describing a population) The parameter is usually denoted by \(\rho\). If the lottery worked and was fair, then we should have \(\rho=0\). So the question becomes: if r = -0.226 could we still have \(\rho=0\)? Again this is answered by a hypothesis test:

pearson.cor(Draft.Number, Day.of.Year, rho.null = 0) 

## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.000
  1. Parameter of interest: Pearson’s correlation coefficient \(\rho\)
  2. Method of analysis: test based on normal theory
  3. Assumptions of Method: relationship is linear, there are no outliers.

  4. \(\alpha\) = 0.05
  5. H0: \(\rho = 0\) (no relationship between “Day of Year” and “Draft Number”)
  6. Ha: \(\rho \ne 0\) (some relationship between “Day of Year” and “Draft Number”)
  7. p = 0.000
  8. p < \(\alpha\), so we reject H0, there is some relationship between “Day of Year” and “Draft Number”, something went wrong in the 1970 draft.

How about the assumptions? we can check them using the marginal plot, which looks just fine.

Here are some cases were Pearson’s correlation coefficient would not work:

Also very important is the fact that Pearson’s correlation coefficient works only for linear relationships:

App: correlation and correlation2

these apps illustrate the correlation coefficient

correlation What to do:

Move slider around to see different cases of the scatterplot of correlated variables include a few outliers and see how that effects that “look” of the scatterplot and the sample correlation coefficient On the Histogram tab set \(\rho\) = -0.23 and observe that we need a sample size of about 60 to have some reasonable chance to reject the null hypothesis of no correlation.

correlation2 What to do

click inside graph and watch the correlation


So, now that we know that there is indeed a relationship between Day of Year and Draft Number, can we visualize it in some way? Here is an idea: let’s look at the boxplot of Draft Number by Month:

bplot(Draft.Number, Month, 
  new_order = "Size")

and here we can see that there is a tendency for the Draft Numbers to be lower for month later in the year.

Note

if we simply run

bplot(Draft.Number, Month)

the routine arranges the boxed alphabetically. Here we want them arranged in order. We can always do the graph in any order we want with the new_order argument. Another reasonable order would be by month:

bplot(Draft.Number, Month, 
  new_order = 
        c(5, 4, 8, 1, 9, 7, 6, 2, 12, 11, 10, 3))


Again notice the similarities and the differences between this analysis and those we have done before: in each case we had the basic question of whether or not there is a relationship between two variables, in each case we did the hypothesis test with the null hypothesis

H0: there is no relationship

but then we used different methods depending on the type of data:

  • Categorical Predictor - Categorical Response: Chi-square test for independence

  • Categorical Predictor - Quantitative Response: ANOVA

  • Quantitative Predictor - Quantitative Response: Pearson’s Correlation Coefficient

For the last two there are even more similarities: for each of these methods there was some assumption of normal distributions

Case Study: The 1971 Military Draft

let’s see what happened the year after:

splot(Draft.Number.1971, Day.of.Year)

so there is no hint of a problem here (but again, that is what we thought before as well).

And the test:

pearson.cor(Draft.Number.1971, Day.of.Year, rho.null = 0)

## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.7861
  1. Parameter of interest: Pearson’s correlation coefficient \(\rho\)
  2. Method of analysis: test based on normal theory
  3. Assumptions of Method: relationship is linear, there are no outliers
  4. \(\alpha\)=0.05
  5. H0: \(\rho\) = 0 (no relationship between “Day 1971” and “Draft Number 1971”)
  6. Ha: \(\rho\) \(\ne\) 0 (some relationship between “Day 1971” and “Draft Number 1971”)
  7. p = 0.7861
  8. p > \(\alpha\), so we fail to reject H0, there is no relationship between “Day 1971” and “Draft Number 1971”. The marginal plot shows no outliers or a non linear relationship, so the assumptions are ok.

The same command can also be used to find a confidence interval. This is done when the rho.null argument is left off:

pearson.cor(Draft.Number.1971, Day.of.Year, conf.level = 90)

## A 90% confidence interval for the 
## population correlation coefficient is ( -0.072, 0.1 )