regression-simple

Simple Linear Regression - The Model

The Model

Definition (6.1.1)

Let \(\pmb{y}=(y_1,..,y_n)\), \(\pmb{x}=(x_1,..,x_n)\), \(\pmb{\epsilon}=(\epsilon_1,..,\epsilon_n)\), \(\beta_0,\beta_1\) numbers, then a model of the form

\[y_i=\beta_0+\beta_1 x_i+\epsilon_i; i=1,..,n\] is called a simple linear regression model.

simple refers to the fact that there is only one predictor \(x\).

We assume that \(\pmb{y}\) and \(\pmb{\epsilon}\) are random vectors whereas \(\pmb{x}\) is fixed. We will consider the case where \(\pmb{X}\) is random later.

We make the following assumptions:

\(E[\epsilon_i]=0\) (model is correct)
\(var(\epsilon_i)=\sigma^2\) (equal variance, homoscadasticity)
\(cov(\epsilon_i, \epsilon_j)=0\) (independence)

Notice that so far there is no assumption regarding the distribution of the \(\epsilon_i\)’s.

Estimation

Definition (6.1.2)

The method of least squares estimates parameters by minimizing

\[R(\beta_0,\beta_1)=\sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)^2\] If we have

\[R(\hat{\beta_0},\hat{\beta_1})=\min \left\{R(\beta_0,\beta_1):\beta_0,\beta_1\in \mathbb{R} \right\}\]

then we write

\[\hat{y}_i=\hat{\beta_0}+\hat{\beta_1}x_i\] and these are called the fitted values. Also we write

\[\hat{\epsilon}_i=y_i-\hat{y}_i\] and these are called the residuals.

Note that

\[\hat{\pmb{\epsilon}}'\hat{\pmb{\epsilon}}=\sum_{i=1}^n (y_i-\hat{y})^2=\sum_{i=1}^n (y_i-\hat{\beta_0}-\hat{\beta_1} x_i)^2\] Let’s find \(\hat{\beta_0},\hat{\beta_1}\):

\[ \begin{aligned} &0=\frac{d R(\beta_0,\beta_1)}{d \beta_0} = (-2)\sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)=\\ &(-2)\left(\sum_{i=1}^n y_i- n\beta_0-\sum_{i=1}^n \beta_1 x_i\right) =\\ &(-2n)\left(\bar{y}- \beta_0- \beta_1 \bar{x}\right) \\ &0=\frac{d R(\beta_0,\beta_1)}{d \beta_1} = (-2)\sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)x_i = \\ &(2n)\left(\overline{xy}- \beta_0\bar{x}- \beta_1 \overline{x^2}\right) \\ &\beta_0+ \beta_1 \bar{x}=\bar{y}\\ &\beta_0\bar{x}+ \beta_1 \overline{x^2} = \overline{xy} \\ &(\bar{y}-\beta_1 \bar{x})\bar{x}+ \beta_1 \overline{x^2} = \overline{xy} \\ &\beta_1 = \frac{\overline{xy}-\bar{y}\bar{x}}{\overline{x^2}-\bar{x}^2}=\frac{\sum_{i=1}^nx_iy_i-n\bar{y}\bar{x}}{\sum_{i=1}^nx_i^2-n\bar{x}^2}\\ \end{aligned} \]

Example (6.1.3)

We have data from a study for 19 developed countries on wine consumption (liters of wine per person per year) and deaths from heart disease (per 100000 people). (taken from David Moore: The Active Practice of Statistics, data set is part of Resma3.Rdata)

Note that strictly speaking this is not an experiment as described above because here \(\pmb{X}\) was random and not fixed. It turns out (and we will later study) that most results hold for both cases.

kable.nice(wine, do.row.names = FALSE)

Country	Wine.Consumption	Heart.Disease.Deaths
Australia	2.5	211
Austria	3.9	167
Belgium	2.9	131
Canada	2.4	191
Denmark	2.9	220
Finland	0.8	297
France	9.1	71
Iceland	0.8	211
Ireland	0.7	300
Italy	7.9	107
Netherlands	1.8	167
New Zealand	1.9	266
Norway	0.8	227
Spain	6.5	86
Sweden	1.6	207
Switzerland	5.8	115
United Kingdom	1.3	285
United States	1.2	199
Germany	2.7	172

ggplot(data=wine, aes(Wine.Consumption, Heart.Disease.Deaths)) +
  geom_point()

Let’s find the least squares regression line:

xbar=mean(wine$Wine.Consumption)
x2bar=mean(wine$Wine.Consumption^2)
ybar=mean(wine$Heart.Disease.Deaths)
xybar=mean(wine$Wine.Consumption*wine$Heart.Disease.Deaths)
beta1=(xybar-xbar*ybar)/(x2bar-xbar^2)
beta0=ybar-beta1*xbar
round(c(beta0, beta1), 2)

## [1] 260.56 -22.97

ggplot(data=wine, aes(Wine.Consumption, Heart.Disease.Deaths)) +
  geom_point() +
  geom_abline(intercept = beta0, slope = beta1, 
              color="blue", size=1.5)

or we can let R do the work:

fit=lm(Heart.Disease.Deaths~Wine.Consumption, data=wine)
coef(fit)

##      (Intercept) Wine.Consumption 
##        260.56338        -22.96877

Note that the least square regression line can always be found, even if the assumptions are not satisfied and the line is a bad model for the data:

x=1:100/10
y=(x-5)^2 + rnorm(100, 0, 5)
df=data.frame(x=x, y=y)
ggplot(data=df, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE)

Theorem (6.1.4)

Under the three assumptions above we have

\(E[\hat{\beta}_1]=\beta_1\)
\(E[\hat{\beta}_0]=\beta_0\)
\(var(\hat{\beta_1})=\frac{\sigma^2}{\sum (x_i-\bar{x})^2}\)
\(var(\hat{\beta_0})=\sigma^2 \left[\frac1n+ \frac{\bar{x}^2}{\sum (x_i-\bar{x})^2}\right]\)

proof

note that

\[ \begin{aligned} &E[\bar{y}] = E[\frac1n\sum y_i]=\frac1n\sum E[y_i]= \\ &\frac1n\sum ( \beta_0+\beta_1x_i) = \beta_0+\beta_1\bar{x}\\ \end{aligned} \]

\[ \begin{aligned} &E[\hat{\beta}_1] = E[\frac{\sum_{i=1}^nx_iy_i-n\bar{y}\bar{x}}{\sum_{i=1}^nx_i^2-n\bar{x}^2}] = \\ &\frac{\sum_{i=1}^nx_iE[y_i]-nE[\bar{y}]\bar{x}}{\sum_{i=1}^nx_i^2-n\bar{x}^2} = \\ &\frac{\sum_{i=1}^nx_i(\beta_0+\beta_1x_i)-n[\beta_0+\beta_1\bar{x}]\bar{x}}{\sum_{i=1}^nx_i^2-n\bar{x}^2} = \\ &\frac{n\beta_0 \bar{x} +\beta_1 \sum_{i=1}^nx_i^2-n\beta_0\bar{x}-n\beta_1\bar{x}^2}{\sum_{i=1}^nx_i^2-n\bar{x}^2} = \\ &\frac{\beta_1 (\sum_{i=1}^nx_i^2-n\bar{x}^2)}{\sum_{i=1}^nx_i^2-n\bar{x}^2} = \beta_1\\ \end{aligned} \]

the other parts are similar.

Until now we assumed that \(\sigma\) is known. If it is not it also has to be estimated from the data. To do so note

\[\sigma^2=E[\epsilon_i^2] =E[(y_i-\hat{y}_i)^2]\]

for i=1,..,n. We can therefore estimate \(\sigma^2\) as the mean of these deviations, however it turns out to be better to use

\[s^2=\frac{\sum (y_i-\hat{\beta}_0-\hat{\beta}_1x_i)^2}{n-2}=\frac{\text{SSE}}{n-2}\]

because then \(s^2\) is an unbiased estimator of \(\sigma^2\).

We define the residual sum of squared errors or error sum of squares SSE by

\[\text{SSE}=\sum (y_i-\hat{\beta}_0-\hat{\beta}_1x_i)^2\]

Hypothesis Testing and Confidence Intervals for \(\beta_1\)

Notice that if \(\beta_1=0\) we have \(y_i=\beta_0+\epsilon_i\) and there are no x’s here, so this shows that x and y are independent. Therefore we might be interested in testing to see whether indeed \(\beta_1=0\).

In order to do a hypothesis test we need to make some assumptions about the distribution of the \(\epsilon_i\). The usual one is

\[\epsilon_i\sim N(0, \sigma)\]

Theorem (6.1.5)

If \[\epsilon_i\sim N(0, \sigma)\] for i=1,..,n, then

\(\hat{\beta_1}\sim N\left(\beta_1, \sigma/\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\right)\)
\((n-2)s^2/\sigma^2\sim \chi^2(n-2)\)
\(\hat{\beta_1}\) and \(s^2\) are independent

proof we will show these results later in greater generality.

Corollary (6.1.6)

\[t=\frac{\hat{\beta_1}}{s/\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}}\sim t(n-2, \delta)\]

where the non-centrality parameter is

\[\delta=\beta_1/[\sigma/\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}]\].

Therefore under the null hypothesis \(H_0: \beta_1=0\) we have \(t\sim t(n-2)\) and a two-sided test rejects the null if

\[|t|>t_{\alpha/2, n-2}\]

The p-value of the test is given by

\[p=2P(T>|t|)\]

where \(T\sim t(n-2)\).

Example (6.1.7)

The 1970’s Military Draft

In 1970, Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. In a truly random lottery there should be no relationship between the date and the draft number.

Here is a scatterplot of Draft.Number by Day.of.Year:

ggplot(data=draft, aes(Day.of.Year, Draft.Number)) +
  geom_point()

There is not supposed to be a relationship between Day.of.Year and Draft. Number, so it makes sense to test \(H_0:\beta_1=0\).

x=draft$Day.of.Year
y=draft$Draft.Number
n=length(x)
beta1hat=(sum(x*y)-n*mean(x)*mean(y))/(sum(x^2)-n*mean(x)^2)
beta0hat=mean(y)-beta1hat*mean(x)
yhat=beta0hat+beta1hat*x
s2=sum((y-yhat)^2)/(n-2)
TS=beta1hat/(sqrt(s2/sum((x-mean(x))^2)))
c(TS, qt(0.95, n-2), 2*(1-pt(abs(TS), n-2)))

## [1] -4.427181e+00  1.649051e+00  1.263829e-05

Again, R can do it for us:

summary(lm(Draft.Number~Day.of.Year, data=draft))

## 
## Call:
## lm(formula = Draft.Number ~ Day.of.Year, data = draft)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -210.837  -85.629   -0.519   84.612  196.157 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 225.00922   10.81197  20.811  < 2e-16
## Day.of.Year  -0.22606    0.05106  -4.427 1.26e-05
## 
## Residual standard error: 103.2 on 364 degrees of freedom
## Multiple R-squared:  0.05109,    Adjusted R-squared:  0.04849 
## F-statistic:  19.6 on 1 and 364 DF,  p-value: 1.264e-05

Theorem (6.1.8)

A \(100(1-\alpha)\%\) confidence interval for \(\beta_1\) is given by

\[\hat{\beta_1}\pm t_{\alpha/2, n-2}\frac{s}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}}\]

proof follows by inverting the hypothesis test in (6.1.6)

Example (6.1.9)

A 95% confidence interval for the slope in the wine data set is given by

x=wine$Wine.Consumption
y=wine$Heart.Disease.Deaths
n=length(x)
beta1hat=(sum(x*y)-n*mean(x)*mean(y))/(sum(x^2)-n*mean(x)^2)
beta0hat=mean(y)-beta1hat*mean(x)
yhat=beta0hat+beta1hat*x
sse = sum((y-yhat)^2)
s2hat=sse/(n-2)
denom=sum((x-mean(x))^2)
round(beta1hat+c(1, -1)*qt(0.05/2, n-2)*sqrt(s2hat/denom), 2)

## [1] -30.47 -15.46

We will not discuss interval estimated and/or hypothesis tests for \(\beta_0\). These of course exist but do not play a large role in Statistics. If indeed \(H_0: \beta_0=0\) is true the model becomes \(\pmb{y=\beta_1x}\), what is called a no-intercept model. Whether such a model is appropriate for an experiment is usually better decided from the context of the experiment and not from some statistical analysis.

Example

We have data \(y_i\), the amount of damage done by tropical storms and hurricanes in year i, and \(x_i\), the number of such storms that hit Puerto Rico in year i. Clearly if \(x_i=0\) we immediately have \(y_i=0\), so a no-intercept model is appropriate.

Coefficient of Determination

Definition (6.1.10)

Residual sum of Squares

\[\text{SSE} = \sum (y_i-\hat{y}_i)^2\]

Regression Sum of Squares

\[\text{SSR} = \sum (\hat{y}_i-\bar{y})^2\]

Total Sum of Squares

\[\text{SST} = \sum (y_i-\bar{y})^2\]

We have SST=SSR+SSE

Definition (6.1.11)

The coefficient of determination \(r^2\) is given by

\[r^2=\frac{\text{SSR}}{\text{SSE}} =\frac{\sum (\hat{y}_i-\bar{y})^2}{\sum (y_i-\bar{y})^2}\]

An intuitive explanation for the coefficient of determination is as follows: it is the proportion of variation in the data explained by the model.

Comments

\[r^2=\frac{s_{xy}^2}{s_{xx}s_{yy}}\]

where \(s_{xy}\) was defined in (5.3.14). Therefore \(r\) is also the absolute value of the sample correlation coefficient.

Let t be the t statistic in (6.1.6), then

\[t=\frac{\hat{\beta}_1}{s/\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}}=\frac{\sqrt{n-2}\text{ }r}{\sqrt{1-r^2}}\]

If \(H_0: \beta_1=0\) is true, then \(t\sim t(n-2)\).

Example (6.1.12)

For the data in example (6.1.3) we find

x=wine$Wine.Consumption
y=wine$Heart.Disease.Deaths
n=length(x)
beta1hat=(sum(x*y)-n*mean(x)*mean(y))/(sum(x^2)-n*mean(x)^2)
beta0hat=mean(y)-beta1hat*mean(x)
yhat=beta0hat+beta1hat*x
ybar=mean(y)
ssr=sum((yhat-ybar)^2)
sst=sum((y-ybar)^2)
r2=ssr/sst
round(r2, 2)

## [1] 0.71