Quantitative Predictor - Quantitative Response: Simple Linear Regression

Case Study: Wine Consumption and Heart Disease

Data for 19 developed countries on wine consumption (liters of wine per person per year) and deaths from heart disease (per 100000 people). (taken from David Moore: The Active Practice of Statistics)

attach(wine)
wine
##           Country Wine.Consumption Heart.Disease.Deaths
## 1       Australia              2.5                  211
## 2         Austria              3.9                  167
## 3         Belgium              2.9                  131
## 4          Canada              2.4                  191
## 5         Denmark              2.9                  220
## 6         Finland              0.8                  297
## 7          France              9.1                   71
## 8         Iceland              0.8                  211
## 9         Ireland              0.7                  300
## 10          Italy              7.9                  107
## 11    Netherlands              1.8                  167
## 12    New Zealand              1.9                  266
## 13         Norway              0.8                  227
## 14          Spain              6.5                   86
## 15         Sweden              1.6                  207
## 16    Switzerland              5.8                  115
## 17 United Kingdom              1.3                  285
## 18  United States              1.2                  199
## 19        Germany              2.7                  172

Basic Question: What is the relationship between wine consumption and heart disease?

with two quantitative variables we will usually start with a scatterplot, but for that we need to decide which of our variables is the predictor and which the response.

Often this will be clear from the question (“predict A from B”), here it is not. But here we have a before-after case (people drink wine for many years, how does that effect their chances of heart disease later in life?) so it seems reasonable that we choose

predictor x = wine consumption
response y = heart disease

and so we have

splot(Heart.Disease.Deaths, Wine.Consumption) 

and clearly we have a strong (negative) relationship. If it is not clear enough we can check Pearson’s correlation coefficient:

cor(Wine.Consumption, Heart.Disease.Deaths)
## [1] -0.8428127

So, how can we describe the relationship? For quantitative data this means finding a model, that is an equation

Examples

  • heart disease = 260-10*wine consumption
  • heart disease = 250-7*wine consumption
  • heart disease = 250-10wine consumption+1.2wine consumption2
  • heart disease = 260-50*log(wine consumption)

ect.

Having such a model would allow us among other things to predict a likely y value for a case with a known x value.

Example

say the model is heart disease = 260-10*wine consumption and we know that for a certain country (not in the dataset) wine consumtion is 3.7, then according to our model the heart disease rate should be about

\[ \text{heart disease} = 260-10*3.7 = 223 \]

How do we find an equation? Well to find some equation is easy:

splot(Heart.Disease.Deaths, Wine.Consumption)

clearly the red line is not very good (to flat), the green one is better but still a bit to flat, but how about the orange and blue ones? Both look reasonably good.

Is there a way to find a line that is “best” ? The answer is yes. In order to understand how we need to following:

Let’s concentrate for a moment on the third line, which has the equation

\[ \text{Heart Disease} = 270-24*\text{Wine Consumption} \]

or short \(y = 270-24x\)

The United States has a wine consumption of \(x = 1.2\) liters and a heart disease rate of \(y = 199\). Now if we did not know the heart disease rate we could use the equation and find

\[ y = 270-24x = 270-24*1.2 = 241 \]

Now we have 2 y’s:

  • the one in the data (\(y = 199\))
  • the one from the equation (\(y = 241\))

Let distinguish between them by calling the first the observed value and the second one the fitted value.

Think of it in these terms: the fitted value is our guess, the observed value is the truth. So the difference between them is the error in our guess. We call this the residual:

\[ \epsilon = \text{fitted} - \text{observed} = 241-199 = 42 \]

The line \(y=270-24x\) overestimates the heart disease rate in the US by \(42\).

If the line perfectly described the data, the residuals would all be 0:

This was done for the US, but of course we could do the same for all the countries in the dataset:

Country Consumption Deaths Fits Residuals
Australia 2.5 211 210.0 1.0
Austria 3.9 167 176.4 -9.4
Belgium 2.9 131 200.4 -69.4
Canada 2.4 191 212.4 -21.4
Denmark 2.9 220 200.4 19.6
Finland 0.8 297 250.8 46.2
France 9.1 71 51.6 19.4
Iceland 0.8 211 250.8 -39.8
Ireland 0.7 300 253.2 46.8
Italy 7.9 107 80.4 26.6
Netherlands 1.8 167 226.8 -59.8
New Zealand 1.9 266 224.4 41.6
Norway 0.8 227 250.8 -23.8
Spain 6.5 86 114.0 -28.0
Sweden 1.6 207 231.6 -24.6
Switzerland 5.8 115 130.8 -15.8
United Kingdom 1.3 285 238.8 46.2
United States 1.2 199 241.2 -42.2
Germany 2.7 172 205.2 -33.2

so for each country our line makes an error. What we need is a way to find an overall error. The most popular method to do this is to find the sum of squares of the residuals: \[ RSS = \sum \epsilon^2 \] In the case of our line we find \[ RSS = (-1.0)^2+9.4^2+..+33.2^2 = 25269.8 \] In the same way we can find an RSS for any line:

  • y = 280-10x , RSS = 71893
  • y = 260-20x , RSS = 40738
  • y = 260-23x , RSS = 24399.7

notice that the first two, which we said were not so good, have a higher RSS. So it seems that the lower the RSS, the better. Is there a line with the smallest RSS possible? The answer is again yes, using the method of Least Squares for which we have the routine:

slr(Heart.Disease.Deaths, Wine.Consumption)
## The least squares regression equation is: 
##  Heart.Disease.Deaths  = 260.563 - 22.969 Wine.Consumption 
## R^2 = 71.03%

The least squares regression equation is:

\[ \text{Heart.Disease.Deaths} = 260.563 - 22.969 \text{Wine.Consumption} \]

very close to the last of our equations.

What is its RSS? It is not part of the output, but I can tell you it is 24391.

A nice graph to visualize the model is the scatterplot with the least squares regression line, called the fitted line plot

splot(Heart.Disease.Deaths, Wine.Consumption, add.line=1)