Let \(\pmb{y}=(y_1,..,y_n)\), \(\pmb{x}=(x_1,..,x_n)\), \(\pmb{\epsilon}=(\epsilon_1,..,\epsilon_n)\), \(\beta_0,\beta_1\) numbers, then a model of the form
\[y_i=\beta_0+\beta_1 x_i+\epsilon_i; i=1,..,n\] is called a simple linear regression model.
simple refers to the fact that there is only one predictor \(x\).
We assume that \(\pmb{y}\) and \(\pmb{\epsilon}\) are random vectors whereas \(\pmb{x}\) is fixed. We will consider the case where \(\pmb{X}\) is random later.
We make the following assumptions:
Notice that so far there is no assumption regarding the distribution of the \(\epsilon_i\)’s.
The method of least squares estimates parameters by minimizing
\[R(\beta_0,\beta_1)=\sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)^2\] If we have
\[R(\hat{\beta_0},\hat{\beta_1})=\min \left\{R(\beta_0,\beta_1):\beta_0,\beta_1\in \mathbb{R} \right\}\]
then we write
\[\hat{y}_i=\hat{\beta_0}+\hat{\beta_1}x_i\] and these are called the fitted values. Also we write
\[\hat{\epsilon}_i=y_i-\hat{y}_i\] and these are called the residuals.
Note that
\[\hat{\pmb{\epsilon}}'\hat{\pmb{\epsilon}}=\sum_{i=1}^n (y_i-\hat{y})^2=\sum_{i=1}^n (y_i-\hat{\beta_0}-\hat{\beta_1} x_i)^2\] Let’s find \(\hat{\beta_0},\hat{\beta_1}\):
\[ \begin{aligned} &0=\frac{d R(\beta_0,\beta_1)}{d \beta_0} = (-2)\sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)=\\ &(-2)\left(\sum_{i=1}^n y_i- n\beta_0-\sum_{i=1}^n \beta_1 x_i\right) =\\ &(-2n)\left(\bar{y}- \beta_0- \beta_1 \bar{x}\right) \\ &0=\frac{d R(\beta_0,\beta_1)}{d \beta_1} = (-2)\sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)x_i = \\ &(2n)\left(\overline{xy}- \beta_0\bar{x}- \beta_1 \overline{x^2}\right) \\ &\beta_0+ \beta_1 \bar{x}=\bar{y}\\ &\beta_0\bar{x}+ \beta_1 \overline{x^2} = \overline{xy} \\ &(\bar{y}-\beta_1 \bar{x})\bar{x}+ \beta_1 \overline{x^2} = \overline{xy} \\ &\beta_1 = \frac{\overline{xy}-\bar{y}\bar{x}}{\overline{x^2}-\bar{x}^2}=\frac{\sum_{i=1}^nx_iy_i-n\bar{y}\bar{x}}{\sum_{i=1}^nx_i^2-n\bar{x}^2}\\ \end{aligned} \]
We have data from a study for 19 developed countries on wine consumption (liters of wine per person per year) and deaths from heart disease (per 100000 people). (taken from David Moore: The Active Practice of Statistics, data set is part of Resma3.Rdata)
Note that strictly speaking this is not an experiment as described above because here \(\pmb{X}\) was random and not fixed. It turns out (and we will later study) that most results hold for both cases.
kable.nice(wine, do.row.names = FALSE)
Country | Wine.Consumption | Heart.Disease.Deaths |
---|---|---|
Australia | 2.5 | 211 |
Austria | 3.9 | 167 |
Belgium | 2.9 | 131 |
Canada | 2.4 | 191 |
Denmark | 2.9 | 220 |
Finland | 0.8 | 297 |
France | 9.1 | 71 |
Iceland | 0.8 | 211 |
Ireland | 0.7 | 300 |
Italy | 7.9 | 107 |
Netherlands | 1.8 | 167 |
New Zealand | 1.9 | 266 |
Norway | 0.8 | 227 |
Spain | 6.5 | 86 |
Sweden | 1.6 | 207 |
Switzerland | 5.8 | 115 |
United Kingdom | 1.3 | 285 |
United States | 1.2 | 199 |
Germany | 2.7 | 172 |
ggplot(data=wine, aes(Wine.Consumption, Heart.Disease.Deaths)) +
geom_point()
Let’s find the least squares regression line:
xbar=mean(wine$Wine.Consumption)
x2bar=mean(wine$Wine.Consumption^2)
ybar=mean(wine$Heart.Disease.Deaths)
xybar=mean(wine$Wine.Consumption*wine$Heart.Disease.Deaths)
beta1=(xybar-xbar*ybar)/(x2bar-xbar^2)
beta0=ybar-beta1*xbar
round(c(beta0, beta1), 2)
## [1] 260.56 -22.97
ggplot(data=wine, aes(Wine.Consumption, Heart.Disease.Deaths)) +
geom_point() +
geom_abline(intercept = beta0, slope = beta1,
color="blue", size=1.5)
or we can let R do the work:
fit=lm(Heart.Disease.Deaths~Wine.Consumption, data=wine)
coef(fit)
## (Intercept) Wine.Consumption
## 260.56338 -22.96877
Note that the least square regression line can always be found, even if the assumptions are not satisfied and the line is a bad model for the data:
x=1:100/10
y=(x-5)^2 + rnorm(100, 0, 5)
df=data.frame(x=x, y=y)
ggplot(data=df, aes(x, y)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
Under the three assumptions above we have
proof
\[ \begin{aligned} &E[\bar{y}] = E[\frac1n\sum y_i]=\frac1n\sum E[y_i]= \\ &\frac1n\sum ( \beta_0+\beta_1x_i) = \beta_0+\beta_1\bar{x}\\ \end{aligned} \]
\[ \begin{aligned} &E[\hat{\beta}_1] = E[\frac{\sum_{i=1}^nx_iy_i-n\bar{y}\bar{x}}{\sum_{i=1}^nx_i^2-n\bar{x}^2}] = \\ &\frac{\sum_{i=1}^nx_iE[y_i]-nE[\bar{y}]\bar{x}}{\sum_{i=1}^nx_i^2-n\bar{x}^2} = \\ &\frac{\sum_{i=1}^nx_i(\beta_0+\beta_1x_i)-n[\beta_0+\beta_1\bar{x}]\bar{x}}{\sum_{i=1}^nx_i^2-n\bar{x}^2} = \\ &\frac{n\beta_0 \bar{x} +\beta_1 \sum_{i=1}^nx_i^2-n\beta_0\bar{x}-n\beta_1\bar{x}^2}{\sum_{i=1}^nx_i^2-n\bar{x}^2} = \\ &\frac{\beta_1 (\sum_{i=1}^nx_i^2-n\bar{x}^2)}{\sum_{i=1}^nx_i^2-n\bar{x}^2} = \beta_1\\ \end{aligned} \]
the other parts are similar.
Until now we assumed that \(\sigma\) is known. If it is not it also has to be estimated from the data. To do so note
\[\sigma^2=E[\epsilon_i^2] =E[(y_i-\hat{y}_i)^2]\]
for i=1,..,n. We can therefore estimate \(\sigma^2\) as the mean of these deviations, however it turns out to be better to use
\[s^2=\frac{\sum (y_i-\hat{\beta}_0-\hat{\beta}_1x_i)^2}{n-2}=\frac{\text{SSE}}{n-2}\]
because then \(s^2\) is an unbiased estimator of \(\sigma^2\).
We define the residual sum of squared errors or error sum of squares SSE by
\[\text{SSE}=\sum (y_i-\hat{\beta}_0-\hat{\beta}_1x_i)^2\]
Notice that if \(\beta_1=0\) we have \(y_i=\beta_0+\epsilon_i\) and there are no x’s here, so this shows that x and y are independent. Therefore we might be interested in testing to see whether indeed \(\beta_1=0\).
In order to do a hypothesis test we need to make some assumptions about the distribution of the \(\epsilon_i\). The usual one is
\[\epsilon_i\sim N(0, \sigma)\]
If \[\epsilon_i\sim N(0, \sigma)\] for i=1,..,n, then
proof we will show these results later in greater generality.
\[t=\frac{\hat{\beta_1}}{s/\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}}\sim t(n-2, \delta)\]
where the non-centrality parameter is
\[\delta=\beta_1/[\sigma/\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}]\].
Therefore under the null hypothesis \(H_0: \beta_1=0\) we have \(t\sim t(n-2)\) and a two-sided test rejects the null if
\[|t|>t_{\alpha/2, n-2}\]
The p-value of the test is given by
\[p=2P(T>|t|)\]
where \(T\sim t(n-2)\).
The 1970’s Military Draft
In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.
Here is a scatterplot of Draft.Number by Day.of.Year:
ggplot(data=draft, aes(Day.of.Year, Draft.Number)) +
geom_point()
There is not supposed to be a relationship between Day.of.Year and Draft. Number, so it makes sense to test \(H_0:\beta_1=0\).
x=draft$Day.of.Year
y=draft$Draft.Number
n=length(x)
beta1hat=(sum(x*y)-n*mean(x)*mean(y))/(sum(x^2)-n*mean(x)^2)
beta0hat=mean(y)-beta1hat*mean(x)
yhat=beta0hat+beta1hat*x
s2=sum((y-yhat)^2)/(n-2)
TS=beta1hat/(sqrt(s2/sum((x-mean(x))^2)))
c(TS, qt(0.95, n-2), 2*(1-pt(abs(TS), n-2)))
## [1] -4.427181e+00 1.649051e+00 1.263829e-05
Again, R can do it for us:
summary(lm(Draft.Number~Day.of.Year, data=draft))
##
## Call:
## lm(formula = Draft.Number ~ Day.of.Year, data = draft)
##
## Residuals:
## Min 1Q Median 3Q Max
## -210.837 -85.629 -0.519 84.612 196.157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 225.00922 10.81197 20.811 < 2e-16
## Day.of.Year -0.22606 0.05106 -4.427 1.26e-05
##
## Residual standard error: 103.2 on 364 degrees of freedom
## Multiple R-squared: 0.05109, Adjusted R-squared: 0.04849
## F-statistic: 19.6 on 1 and 364 DF, p-value: 1.264e-05
A \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) is given by
\[\hat{\beta_1}\pm t_{\alpha/2, n-2}\frac{s}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}}\]
proof follows by inverting the hypothesis test in (6.1.6)
A 95% confidence interval for the slope in the wine data set is given by
x=wine$Wine.Consumption
y=wine$Heart.Disease.Deaths
n=length(x)
beta1hat=(sum(x*y)-n*mean(x)*mean(y))/(sum(x^2)-n*mean(x)^2)
beta0hat=mean(y)-beta1hat*mean(x)
yhat=beta0hat+beta1hat*x
sse = sum((y-yhat)^2)
s2hat=sse/(n-2)
denom=sum((x-mean(x))^2)
round(beta1hat+c(1, -1)*qt(0.05/2, n-2)*sqrt(s2hat/denom), 2)
## [1] -30.47 -15.46
We will not discuss interval estimated and/or hypothesis tests for \(\beta_0\). These of course exist but do not play a large role in Statistics. If indeed \(H_0: \beta_0=0\) is true the model becomes \(\pmb{y=\beta_1x}\), what is called a no-intercept model. Whether such a model is appropriate for an experiment is usually better decided from the context of the experiment and not from some statistical analysis.
We have data \(y_i\), the amount of damage done by tropical storms and hurricanes in year i, and \(x_i\), the number of such storms that hit Puerto Rico in year i. Clearly if \(x_i=0\) we immediately have \(y_i=0\), so a no-intercept model is appropriate.
\[\text{SSE} = \sum (y_i-\hat{y}_i)^2\]
\[\text{SSR} = \sum (\hat{y}_i-\bar{y})^2\]
\[\text{SST} = \sum (y_i-\bar{y})^2\]
We have SST=SSR+SSE
The coefficient of determination \(r^2\) is given by
\[r^2=\frac{\text{SSR}}{\text{SSE}} =\frac{\sum (\hat{y}_i-\bar{y})^2}{\sum (y_i-\bar{y})^2}\]
An intuitive explanation for the coefficient of determination is as follows: it is the proportion of variation in the data explained by the model.
Comments
\[r^2=\frac{s_{xy}^2}{s_{xx}s_{yy}}\]
where \(s_{xy}\) was defined in (5.3.14). Therefore \(r\) is also the absolute value of the sample correlation coefficient.
\[t=\frac{\hat{\beta}_1}{s/\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}}=\frac{\sqrt{n-2}\text{ }r}{\sqrt{1-r^2}}\]
If \(H_0: \beta_1=0\) is true, then \(t\sim t(n-2)\).
For the data in example (6.1.3) we find
x=wine$Wine.Consumption
y=wine$Heart.Disease.Deaths
n=length(x)
beta1hat=(sum(x*y)-n*mean(x)*mean(y))/(sum(x^2)-n*mean(x)^2)
beta0hat=mean(y)-beta1hat*mean(x)
yhat=beta0hat+beta1hat*x
ybar=mean(y)
ssr=sum((yhat-ybar)^2)
sst=sum((y-ybar)^2)
r2=ssr/sst
round(r2, 2)
## [1] 0.71