Models with one continuous predictor

Simple Regression

In this chapter we will discuss so called linear models. These are models of the form

\[ Y=\beta_0 +\beta_1 x +\epsilon \]

at first it seems rather restrictive to only consider linear models, but in fact these are quite general. For one, the models are linear in the parameters, so for example

\[ \begin{aligned} &Y = a X^b\\ &\log Y = \log (ax^b) =\\ &\log a + b \log x\\ \end{aligned} \]

is also a linear model!

Least Squares Regression

Example: Wine Consumption and Heart Desease

data on wine consumption per person per year and deaths from heart disease per 100000, by country

kable.nice(wine)
Country Wine.Consumption Heart.Disease.Deaths
Australia 2.5 211
Austria 3.9 167
Belgium 2.9 131
Canada 2.4 191
Denmark 2.9 220
Finland 0.8 297
France 9.1 71
Iceland 0.8 211
Ireland 0.7 300
Italy 7.9 107
Netherlands 1.8 167
New Zealand 1.9 266
Norway 0.8 227
Spain 6.5 86
Sweden 1.6 207
Switzerland 5.8 115
United Kingdom 1.3 285
United States 1.2 199
Germany 2.7 172
ggplot(wine, 
       aes(Wine.Consumption, Heart.Disease.Deaths)) +
  geom_point() +
  labs(x="Wine Consumption", y="Deaths")

We want to fit a linear model, that is a straight line. We will use the method of least squares. The idea is as follows:

say the model is

\[ \text{heart disease} = 260-10\times \text{ wine consumption} \]

and we know that for a certain country (not necessarily in the data set) wine consumption is 3.7, then according to our model the heart disease rate should be about

\[ \text{heart disease} = 260-10\times 3.7 = 223 \]

How do we find an equation? Well, to find some equation is easy:

clearly the red line is not very good (to flat), the green one is better but still a bit to flat, but how about the orange and blue ones? Both look reasonably good.

Is there a way to find a line that is “best” ? The answer is yes. In order to understand how we need to do the following:

Let’s concentrate for a moment on the third line, which has the equation

\[ \text{heart disease} = 270-24*\text{ wine consumption} \]

or short \(y = 270-24x\)

The United States has a wine consumption of \(x = 1.2\) liters and a heart disease rate of \(y = 199\). Now if we did not know the heart disease rate we could use the equation and find

\[ y = 270-24x = 270-24*1.2 = 241 \]

Now we have 2 y’s:

  • the one in the data (\(y = 199\))
  • the one from the equation (\(y = 241\))

We can distinguish between them by calling the first one the observed value and the second one the fitted value.

Think of it in these terms: the fitted value is our guess, the observed value is the truth. So the difference between them is the error in our guess. We call this the residual:

\[ \epsilon = \text{fitted} - \text{observed} = 241-199 = 42 \]

The line \(y=270-24x\) overestimates the heart disease rate in the US by \(42\).

If the line perfectly described the data, the residuals would all be 0:

This was done for the US, but of course we could do the same for all the countries in the data set:

fits <- 270-24*wine[,2]
out <- cbind(wine, fits, wine[, 3]-fits)
colnames(out)[2:5] <- c("Consumption", "Deaths", "Fits", "Residuals")
kable.nice(out[1:5, ])
Country Consumption Deaths Fits Residuals
Australia 2.5 211 210.0 1.0
Austria 3.9 167 176.4 -9.4
Belgium 2.9 131 200.4 -69.4
Canada 2.4 191 212.4 -21.4
Denmark 2.9 220 200.4 19.6

so for each country our line makes an error. What we need is a way to find an overall error. The idea of least squares is to find the sum of squares of the residuals: \[ RSS = \sum \epsilon^2 \] In the case of our line we find

sum(out$Residuals^2)
## [1] 25269.84

In the same way we can find an RSS for any line:

  • y = 280-10x , RSS = 71893
  • y = 260-20x , RSS = 40738
  • y = 260-23x , RSS = 24399.7

the first two, which we said were not so good, have a higher RSS. So it seems that the lower the RSS, the better. Is there a line with the smallest RSS possible? The answer is again yes, using the method of Least Squares for which we have the routine:

fit <- lm(Heart.Disease.Deaths~Wine.Consumption,
    data=wine)
round(fit$coef, 2)
##      (Intercept) Wine.Consumption 
##           260.56           -22.97

The least squares regression equation is:

\[ \text{heart disease} = 260.56 - 22.97 \text{ wine consumption} \]

very close to the last of our equations.

What is its RSS? It is not part of the output, but I can tell you it is 24391.

A nice graph to visualize the model is the scatterplot with the least squares regression line, called the fitted line plot

plt +
  geom_smooth(method = "lm", se=FALSE)

Alternatives to Least Squares

Instead of minimizing the sum of squares we could also have

  • minimized the largest absolute residual
  • minimized the sum of the absolute residuals
  • some other figure of merit.

Historically least squares was use mostly because it could be done analytically:

\[ \begin{aligned} &\frac{d}{d \beta_0} \sum \left(y_i - \beta_0 - \beta_1 x_i\right)^2 = \\ &(-2)\sum \left(y_i - \beta_0 - \beta_1 x_i\right) = \\ &(-2) \left( \sum y_i - n\beta_0 - \beta_1 \sum x_i\right) = 0\\ &\hat{\beta_0} = \bar{y}-\beta_1 \bar{x} \end{aligned} \] and the same for \(\beta_1\). These days we can use pretty much any criterion we want:

fit.abs <- function(beta) 
  sum(abs(wine$Heart.Disease.Deaths
          -beta[1]-beta[2]*wine$Wine.Consumption))
round(nlm(fit.abs, c(260, -23))$estimate, 2)
## [1] 239.01 -18.46
plt +
  geom_smooth(method = "lm", se=FALSE) +
  geom_abline(slope = -18.46, 
              intercept = 239, color="red")

This is often called the \(L_1\) regression. This is implemented in

library(robustbase)
X <- cbind(1, wine$Wine.Consumption)
lmrob.lar(X, wine$Heart.Disease.Deaths)$coefficients
## [1] 239.00000 -18.46154

which uses a much better algorithm based on the simplex method.

One way to understand the difference between these two is the following: let’s use least squares/absolute value to estimate the mean of a single variable!

So we have the model

\[ Y=\beta_0 + \epsilon \]

Using least square (now with \(\beta_1=0\)) yields as above \(\hat{\beta_0}=\bar{y}\), the sample mean. What does absolute error give? It can be shown that it leads to the median!

Just as the median is a robust (aka does not depend so much on outliers) estimator than the mean, \(L_1\) estimation also is more robust.

Example: artificial example

x <- 1:20
y1 <- x + rnorm(20, 0, 1.5)
y2 <- y1 + rnorm(20, 0, 0.1)
y2[1] <- 20
df <- data.frame(x=c(x, x), y=c(y1, y2), 
          which=rep(c("with Outlier", "Without Outlier"), each=20))
lm1 <- coef(lm(y1~x))
lm2 <- coef(lm(y2~x))
l11 <- lmrob.lar(cbind(1, x), y1)$coefficients
l12 <- lmrob.lar(cbind(1, x), y2)$coefficients
print(lm1)
## (Intercept)           x 
##  -0.9558222   1.0405002
ggplot(df, aes(x, y, color=which)) +
  geom_point() +
  geom_abline(intercept = lm1[1], slope = lm1[2], color="blue") +
  geom_abline(intercept = lm2[1], slope = lm2[2], color="blue") +
  geom_abline(intercept = l11[1], slope = l11[2], color="red") +
  geom_abline(intercept = l12[1], slope = l12[2], color="red")   

and we see that the effect of the outlier is much larger on the least squares regression than on the \(L_1\).


One feature of least square regression that is sometimes (often?) undesirable is that it is not symmetric in x and y, that is if we solve the equation for x we don’t get the least squares regression equation of y.

Example

We have the gold-medal winning performances in the 200 meter dash at the Olympic games for men and women:

df <- data.frame(Men=olympics$Time.Men[1:16],      
                 Women=olympics$Time.Women[1:16])
fitMF <- lm(Women~Men, data=df)
cMF <- round(coef(fitMF), 2)
plt1 <- ggplot(data=df, aes(Men, Women)) +
  geom_point() +
  geom_abline(intercept = cMF[1], 
              slope = cMF[2],
              color="blue")
fitFM <- lm(Men~Women, data=df)
cFM <- round(coef(fitFM), 2)
plt2 <- ggplot(data=df, aes(Women, Men)) +
  geom_point() +
  geom_abline(intercept = cFM[1], 
              slope = cFM[2],
              color="blue") 
pushViewport(viewport(layout = grid.layout(1, 2)))
print(plt1 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(plt2 ,
  vp=viewport(layout.pos.row=1, layout.pos.col=2))        

and there is no reason why we should use one or the other. But

cMF
## (Intercept)         Men 
##       -8.76        1.56

\[ \begin{aligned} &y = -8.76+1.56x\\ &x = (y+8.76)/1.56 = 0.64+5.62y\\ \end{aligned} \]

and that is not the same as

cFM
## (Intercept)       Women 
##       10.07        0.44

Here is an idea: let’s use a criterion that is symmetric in x and y. To start we have to standardize the data sets:

\[ \begin{aligned} &m_x = \bar{x}\text{, }s_x=sd(x)\text{, }z_x=(x-m_x)/s_x\\ &m_y = \bar{y}\text{, }s_y=sd(y)\text{, }z_y=(y-m_y)/s_y \end{aligned} \] and now we minimize

\[ E(a, b) = \sum(z_y-a z_x-b)^2 + \sum(z_x-a z_y-b)^2 \]

e.fun <- function(par) {
  sum((zy-par[1]*zx-par[2])^2)+sum((zx-par[1]*zy-par[2])^2)
}
mx <- mean(df$Men); sx <- sd(df$Men); zx <- (df$Men-mx)/sx
my <- mean(df$Women); sy <- sd(df$Women); zy <- (df$Women-my)/sy
p <- as.numeric(optim(cMF, e.fun)$par)

Now \[ \begin{aligned} &\frac{y-m_y}{s_y} = a \frac{x-m_x}{s_x}+b\\ &y= m_y+s_y(a \frac{x-m_x}{s_x}+b) \\ &y=m_y+\frac{s_y}{s_x}(ax-am_x+bs_x) \\ &y=\frac{as_y}{s_x}x+ \left[ m_y+\frac{s_y}{s_x}(bs_x-am_x) \right] \\ \end{aligned} \] so

slope <- p[1]*sy/sx
intercept <- my+sy/sx*(p[2]*sx-p[1]*mx)
ggplot(data=df, aes(Men, Women)) +
  geom_point() +
  geom_abline(intercept = intercept, 
              slope = slope,
              color="blue") 

and now the other way around, which is easy to do because everything is symmetric in x and y:

\[ x=\frac{as_x}{s_y}y+ \left[ m_x+\frac{s_x}{s_y}(bs_y-am_y) \right] \]

slope <- p[1]*sx/sy
intercept <- mx+sx/sy*(p[2]*sy-p[1]*my)
ggplot(data=df, aes(Women, Men)) +
  geom_point() +
  geom_abline(intercept = intercept, 
              slope = slope,
              color="blue")

ANOVA

ANOVA can also be viewed as a linear model, where the predictor variable is categorical. The main difference is that in ANOVA the “model” is found via the likelihood ratio test rather than least squares and that the main interest is in hypothesis testing rather than prediction.