Data for 19 developed countries on wine consumption (liters of wine per person per year) and deaths from heart disease (per 100000 people). (taken from David Moore: The Active Practice of Statistics)
attach(wine)
wine
## Country Wine.Consumption Heart.Disease.Deaths
## 1 Australia 2.5 211
## 2 Austria 3.9 167
## 3 Belgium 2.9 131
## 4 Canada 2.4 191
## 5 Denmark 2.9 220
## 6 Finland 0.8 297
## 7 France 9.1 71
## 8 Iceland 0.8 211
## 9 Ireland 0.7 300
## 10 Italy 7.9 107
## 11 Netherlands 1.8 167
## 12 New Zealand 1.9 266
## 13 Norway 0.8 227
## 14 Spain 6.5 86
## 15 Sweden 1.6 207
## 16 Switzerland 5.8 115
## 17 United Kingdom 1.3 285
## 18 United States 1.2 199
## 19 Germany 2.7 172
Basic Question: What is the relationship between wine consumption and heart disease?
with two quantitative variables we will usually start with a scatterplot, but for that we need to decide which of our variables is the predictor and which the response.
Often this will be clear from the question (“predict A from B”), here it is not. But here we have a before-after case (people drink wine for many years, how does that effect their chances of heart disease later in life?) so it seems reasonable that we choose
predictor x = wine consumption
response y = heart disease
and so we have
splot(Heart.Disease.Deaths, Wine.Consumption)
and clearly we have a strong (negative) relationship. If it is not clear enough we can check Pearson’s correlation coefficient:
cor(Wine.Consumption, Heart.Disease.Deaths)
## [1] -0.8428127
So, how can we describe the relationship? For quantitative data this means finding a model, that is an equation
Examples
ect.
Having such a model would allow us among other things to predict a likely y value for a case with a known x value.
Example
say the model is heart disease = 260-10*wine consumption and we know that for a certain country (not in the dataset) wine consumtion is 3.7, then according to our model the heart disease rate should be about
\[ \text{heart disease} = 260-10*3.7 = 223 \]
How do we find an equation? Well to find some equation is easy:
splot(Heart.Disease.Deaths, Wine.Consumption)
clearly the red line is not very good (to flat), the green one is better but still a bit to flat, but how about the orange and blue ones? Both look reasonably good.
Is there a way to find a line that is “best” ? The answer is yes. In order to understand how we need to following:
Let’s concentrate for a moment on the third line, which has the equation
\[ \text{Heart Disease} = 270-24*\text{Wine Consumption} \]
or short \(y = 270-24x\)
The United States has a wine consumption of \(x = 1.2\) liters and a heart disease rate of \(y = 199\). Now if we did not know the heart disease rate we could use the equation and find
\[ y = 270-24x = 270-24*1.2 = 241 \]
Now we have 2 y’s:
Let distinguish between them by calling the first the observed value and the second one the fitted value.
Think of it in these terms: the fitted value is our guess, the observed value is the truth. So the difference between them is the error in our guess. We call this the residual:
\[ \epsilon = \text{fitted} - \text{observed} = 241-199 = 42 \]
The line \(y=270-24x\) overestimates the heart disease rate in the US by \(42\).
If the line perfectly described the data, the residuals would all be 0:
This was done for the US, but of course we could do the same for all the countries in the dataset:
Country | Consumption | Deaths | Fits | Residuals |
---|---|---|---|---|
Australia | 2.5 | 211 | 210.0 | 1.0 |
Austria | 3.9 | 167 | 176.4 | -9.4 |
Belgium | 2.9 | 131 | 200.4 | -69.4 |
Canada | 2.4 | 191 | 212.4 | -21.4 |
Denmark | 2.9 | 220 | 200.4 | 19.6 |
Finland | 0.8 | 297 | 250.8 | 46.2 |
France | 9.1 | 71 | 51.6 | 19.4 |
Iceland | 0.8 | 211 | 250.8 | -39.8 |
Ireland | 0.7 | 300 | 253.2 | 46.8 |
Italy | 7.9 | 107 | 80.4 | 26.6 |
Netherlands | 1.8 | 167 | 226.8 | -59.8 |
New Zealand | 1.9 | 266 | 224.4 | 41.6 |
Norway | 0.8 | 227 | 250.8 | -23.8 |
Spain | 6.5 | 86 | 114.0 | -28.0 |
Sweden | 1.6 | 207 | 231.6 | -24.6 |
Switzerland | 5.8 | 115 | 130.8 | -15.8 |
United Kingdom | 1.3 | 285 | 238.8 | 46.2 |
United States | 1.2 | 199 | 241.2 | -42.2 |
Germany | 2.7 | 172 | 205.2 | -33.2 |
so for each country our line makes an error. What we need is a way to find an overall error. The most popular method to do this is to find the sum of squares of the residuals: \[ RSS = \sum \epsilon^2 \] In the case of our line we find \[ RSS = (-1.0)^2+9.4^2+..+33.2^2 = 25269.8 \] In the same way we can find an RSS for any line:
notice that the first two, which we said were not so good, have a higher RSS. So it seems that the lower the RSS, the better. Is there a line with the smallest RSS possible? The answer is again yes, using the method of Least Squares for which we have the routine:
slr(Heart.Disease.Deaths, Wine.Consumption)
## The least squares regression equation is:
## Heart.Disease.Deaths = 260.563 - 22.969 Wine.Consumption
## R^2 = 71.03%
The least squares regression equation is:
\[ \text{Heart.Disease.Deaths} = 260.563 - 22.969 \text{Wine.Consumption} \]
very close to the last of our equations.
What is its RSS? It is not part of the output, but I can tell you it is 24391.
A nice graph to visualize the model is the scatterplot with the least squares regression line, called the fitted line plot
splot(Heart.Disease.Deaths, Wine.Consumption, add.line=1)