Chapter 7: Correlation and Regression

Correlation Test

We can now return to the case of two quantitative variables and the question of whether or not they are related. Specifically, we have a test of

\(H_0: \rho =0\) (no relationship) vs \(H_a: \rho \ne 0\) (some relationship)

The assumptions of the test are that the relationship is linear and that there are no outliers. We can use the mplot command to check them. The command to find the p value is pearson.cor.

Case Study: The 1970’s Military Draft

We have previously used simulation to see that a sample correlation r=-0.226 is very unusual (for n=366). Now we can do the formal test:

  1. Parameter: Pearson’s correlation coefficient \(\rho\)
  2. Method: Test for Pearson’s correlation coefficient \(\rho\)
  3. Assumptions: relationship is linear and that there are no outliers.
  4. \(\alpha = 0.05\)
  5. \(H_0: \rho =0\) (no relationship between Day of Year and Draft Number)
  6. \(H_a: \rho \ne 0\) (some relationship between Day of Year and Draft Number)
  7. p = 0.0000
pearson.cor(Draft.Number, Day.of.Year, rho.null = 0)
## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.000
  1. \(p=0.0000 <\alpha = 0.05\), so we reject the null hypothesis,
  2. There is a statistically significant relationship between Day of Year and Draft Number.

Assumptions: boxplots and scatterplot show no outliers. No non-linear relationship.

Case Study: Alcohol vs. Tobacco Expenditure

Data from a British government survey of household spending may be used to examine the relationship between household spending on tobacco products and alcoholic beverages. The numbers are the average expenditure for each of the 11 regions of England.

The marginal plot shows one outlier:

attach(alcohol)
mplot(Tobacco, Alcohol)

This is Northern Ireland, observation # 11. Eliminating this observations show no more outliers, and a linear relationship:

mplot(Tobacco[-11], Alcohol[-11])

So the test is:

  1. Parameter: Pearson’s correlation coefficient \(\rho\)
  2. Method: Test for Pearson’s correlation coefficient \(\rho\)
  3. Assumptions: relationship is linear and that there are no outliers.
  4. \(\alpha = 0.05\)
  5. \(H_0: \rho =0\) (no relationship between Alcohol and Tobacco)
  6. \(H_a: \rho \ne0\) (some relationship between Alcohol and Tobacco)
  7. \(p = 0.0072\)
pearson.cor(Tobacco[-11], Alcohol[-11], rho.null = 0)
## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.0072
  1. \(p=0.0072 < \alpha = 0.05\), so we reject the null hypothesis,
  2. There is a statistically significant relationship between Alcohol and Tobacco

Note: Running the test with Northern Ireland would have given the wrong answer:

pearson.cor(Tobacco, Alcohol, rho.null = 0)
## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.5087

p value of test for now is: 0.5087 > 0.05

Example below is the data from the following experiment: on 30 consecutive days we recorded the number of sales in store. During this time an add campaign was run. Find a \(90\%\) confidence interval for the correlation of Day and Sales.

Day Sales
1 48
2 47
3 52
4 50
5 48
6 53
7 49
8 49
9 54
10 55
11 54
12 55
13 51
14 57
15 47
16 46
17 55
18 56
19 47
20 54
21 45
22 53
23 54
24 56
25 58
26 55
27 57
28 52
29 51
30 54
attach(sales)
mplot(Day, Sales)
pearson.cor(Day, Sales, conf.level = 90)

## A 90% confidence interval for the 
## population correlation coefficient is ( 0.086, 0.617 )