Consider the the data set hubble. In 1929 Edwin Hubble published a paper showing a relationship between the distance and radial velocity away from Earth of “extra-galactic nebulae” (galaxies). His findings revolutionized astronomy. The “Hubble constant,” the slope of the regression of velocity (Y) on distance (X), is still a subject of research and debate. The data here are those Hubble published in his original paper.
Question: If it is true there is a linear relationship between Velocity (Y) and Distance (X), what is the slope of the line?
If there is a linear relationship, there exist \(\beta_0\) and \(\beta_1\) such that
\[Y_i=\beta_0+\beta_1 X_i+\epsilon_i\]
i=1,..,n
where the \(\epsilon_i\) are (again) called the residuals.
In the problem above the main task is to find a interval estimate for \(\beta_1\). In other problems it might be to estimate Y for a specific value of x, to estimate E[Y] for some x, to see whether \(\beta_0\) or \(\beta_1\) are zero (or some other value) etc.
Another version of the regression problem is
\[Y_i=\beta_0+\beta_1 x_i+\epsilon_i\] i=1,..,n
that is the x’s are not random but fixed. For example the income (y) and the number of years of service (x) of randomly selected employees in a company. In practice these two versions are usually treated the same way.
First we need a probability model. Again this will depend on the problem, but one often used is to assume that (X,Y) are bivariate normal with parameters
(\(\mu_x\),\(\mu_y\),\(\sigma_x\),\(\sigma_y\),\(\rho\))
If we think in terms of predicting Y from a fixed value of X=x we need the conditional distribution of Y|X=x, which is
\[Y|X=x \sim N(\mu_y+\rho\sigma_y/\sigma_x(x-\mu_x),\sigma_y\sqrt{1-\rho^2})\]
Therefore we have
\[E[Y|X=x] = \mu_y+\rho\sigma_y/\sigma_x(x-\mu_x) = \mu_y-\mu_x\rho\sigma_y/\sigma_x+\rho\sigma_y/\sigma_x x\]
so we find that under this probability model we have a natural linear relationship between X and Y with
\[\beta_0=\mu_y-\mu_x\rho\sigma_y/\sigma_x\]
and
\[\beta_1=\rho\sigma_y/\sigma_x\]
Generally in a regression context the analysis is carried out using the conditional distribution of \((Y_1,..,Y_n)\) given \(X_1=x_1,..,X_n=x_n\), in which case we can consider the x’s as fixed and known. The probability model then becomes
\[Y_i=\beta_0+\beta_1 x_i+\epsilon_i\]
\(\epsilon_i \sim N(0,\sigma), i=1,..,n\)
Notice that we are assuming equal variance. If this is not reasonable, the analysis is still possible but somewhat more difficult.
In linear regression it is common to use the method of least squares for estimation, that is to find \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that minimize
\[\sum_{i=1}^n (y_i-\beta_0-\beta_1 x_i)^2\] Instead let’s find the mle’s of \(\beta_0\), \(\beta_1\) and \(\sigma\):
\[ \begin{aligned} &f(\pmb{x}\vert\beta_0\beta_1,\sigma) = (2\pi\sigma^2)^{-n/2}\exp \left\{-\frac1{2\sigma^2}\sum]\left[y_i-\beta_0-\beta_1 x_i\right]^2 \right\} \\ &l(\beta_0\beta_1,\sigma) =\frac{n}2\log(2\pi\sigma^2)-\frac1{2\sigma^2}\sum\left[y_i-\beta_0-\beta_1 x_i\right]^2 \\ &\frac{dl}{d\beta_0} =-\frac1{\sigma^2}\sum\left[y_i-\beta_0-\beta_1 x_i\right]=-\frac1{\sigma^2}\left[\sum y_i-n\beta_0-\beta_1\sum x_i\right]=0 \\ &n\hat{\beta}_0=\sum y_i-\beta_1\sum x_i\\ &\text{ }\\ &\frac{dl}{d\beta_1} =-\frac1{\sigma^2}\sum\left[y_i-\beta_0-\beta_1 x_i\right]x_i=-\frac1{\sigma^2}\left[\sum x_iy_i-\beta_0\sum x_i-\beta_1\sum x_i^2\right]=0 \\ &\beta_0\sum x_i=\sum x_iy_i-\beta_1\sum x_i^2 \end{aligned} \] and this system of equations has the solution
\[\hat{\beta}_1=\frac{\sum x_iy_i-(\sum x_i)(\sum y_i)/n}{\sum x_i^2-(\sum x_i)/n}\\\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}\]
which are of course the standard least squares regression estimates!
For \(\sigma^2\) we find
\[ \begin{aligned} &\frac{dl}{d\sigma^2} =-\frac{n}{}\sigma^2+\frac1{2(\sigma^2)^2}\sum\left[y_i-\beta_0-\beta_1 x_i\right]^2=0\\ &\widehat{\sigma^2} = \frac1{n}\sum\left[y_i-\hat{\beta}_0-\hat{\beta}_1 x_i\right]^2 \end{aligned} \]
What are the sampling distributions? First we can write
\[\hat{\beta}_0 =\sum \left[\frac1n-\frac{x_i-\bar{x}}{\sum x_i^2-(\sum x_i)/n}\right]y_i\] and so \(\beta_0\) is a linear combination of normal rv’s, and therefore normal itself. Moreover
\[ \begin{aligned} &E[\hat{\beta}_0] =\sum \left[\frac1n-\frac{x_i-\bar{x}}{\sum x_i^2-(\sum x_i)/n}\right]E[y_i] = \\ &\sum \left[\frac1n-\frac{x_i-\bar{x}}{\sum x_i^2-(\sum x_i)/n}\right](\beta_0+\beta_1x_i) = \beta_0\\ &var(\hat{\beta}_0) =\sigma^2\frac{\sum x_i^2}{n(\sum x_i^2-(\sum x_i)/n)} \\ \end{aligned} \]
similarly we can show that
\[\hat{\beta}_1\sim N(\beta_1,\sigma/\sqrt{\sum x_i^2-(\sum x_i)/n})\\ (n-2)\widehat{s^2}/\sigma^2\sim \chi^2(n-2)\]
A confidence interval for the slope can be found as follows:
\[\hat{\beta}_1\pm qt(1-\alpha/2, n-1)\sqrt{\frac{\widehat{\sigma^2}}{\sum x_i^2-(\sum x_i)/n}}\]
alpha <- 0.05
x <- hubble$Distance
y <- hubble$Velocity
n <- length(x)
xbar <- mean(x)
ybar <- mean(y)
Sxx <- sum(x^2) - sum(x)^2/n
Sxy <- sum(x*y) - sum(x)*sum(y)/n
beta1 <- Sxy/Sxx
beta0 <- ybar - beta1*xbar
fits <- beta0 + beta1*x
e <- y - fits
sse <- sum(e^2)/(n-2)
round(beta1 + c(-1, 1)*qt(1-alpha/2, n-2)*sqrt(sse/Sxx), 1)
## [1] 298.1 610.2
or with R function lm:
fit <- lm(Velocity~Distance, data=hubble)
summary(fit)
##
## Call:
## lm(formula = Velocity ~ Distance, data = hubble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -397.96 -158.10 -13.16 148.09 506.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.78 83.44 -0.489 0.63
## Distance 454.16 75.24 6.036 4.48e-06
##
## Residual standard error: 232.9 on 22 degrees of freedom
## Multiple R-squared: 0.6235, Adjusted R-squared: 0.6064
## F-statistic: 36.44 on 1 and 22 DF, p-value: 4.477e-06
round(confint(fit)[2, ], 1)
## 2.5 % 97.5 %
## 298.1 610.2
Say \(x_1,..,x_n\) are fixed numbers and \(Y_i=\alpha+\beta x_i+\epsilon_i\), where \(\epsilon_i \sim N(0,\sigma)\). We want to predict \(y= \alpha+\beta x\) for some x. Specifically we want to find a \((1-\alpha)100\%\) confidence interval for \(Y=\alpha+\beta x+\epsilon\).
Using least squares we find the estimators of \(\alpha\) and \(\beta\)
\[ \begin{aligned} &S_{xy} =\sum (x-\bar{x})(y-\bar{y}) \\ &\hat{\alpha} =S_{xy}/S_{xx} \\ &\hat{\beta} = \bar{y}-\hat{\alpha}\bar{x}\\ \end{aligned} \]
From here one can show that for a fixed x (not necessarily one of the xi’s) a \((1-\alpha)100\%\) confidence the interval for \(\mu_x=E[Y]=E[\alpha+\beta x]\) is given by
\[\hat{\alpha} +\hat{\beta}x\pm t_{n-1,\alpha/2}\hat{\sigma}\sqrt{\frac1n+\frac{(x-\bar{x})^2}{S_{xx}}}\]
where \(\hat{\sigma}^2=\frac{1}{n-2}\sum \left(y_i-\hat{\alpha} -\hat{\beta}x\right)\). So this is a confidence interval for the mean response at some given x. If we want a confidence interval for an individual response we have to use
\[\hat{\alpha} +\hat{\beta}x\pm t_{n-1,\alpha/2}\hat{\sigma}\sqrt{1+\frac1n+\frac{(x-\bar{x})^2}{S_{xx}}}\]
Consider the wine data set from Resma3:
ggplot(data=wine, aes(Wine.Consumption, Heart.Disease.Deaths)) +
geom_point()
Say we want a 90% confidence interval for the heart disease rate of a country with a wine consumption of 5 liters.
xnew=5
x=wine$Wine.Consumption
y=wine$Heart.Disease.Deaths
n=length(x)
sxx=sum((x-mean(x))^2)
syy=sum((y-mean(y))^2)
sxy=sum((x-mean(x))*(y-mean(y)))
alphahat=sxy/sxx
betahat=mean(y)-alphahat*mean(x)
yhat=betahat+alphahat*x
sigmahat=sqrt(sum((y-yhat)^2)/(n-2))
round(c(n, betahat, alphahat, sigmahat), 2)
## [1] 19.00 260.56 -22.97 37.88
round(betahat+alphahat*xnew+c(-1,1)*qt(1-alpha/2,n-1)*sigmahat*sqrt(1+1/n+(xnew-mean(x))^2/sxx), 1)
## [1] 62.8 228.7
or using R:
fit=lm(y~x)
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.95 -25.91 -12.35 26.97 55.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 260.563 13.835 18.833 7.97e-13
## x -22.969 3.557 -6.457 5.91e-06
##
## Residual standard error: 37.88 on 17 degrees of freedom
## Multiple R-squared: 0.7103, Adjusted R-squared: 0.6933
## F-statistic: 41.69 on 1 and 17 DF, p-value: 5.913e-06
predict(fit,newdata=data.frame(x=5), se.fit=TRUE, interval="prediction")
## $fit
## fit lwr upr
## 1 145.7195 62.39922 229.0399
##
## $se.fit
## [1] 11.17192
##
## $df
## [1] 17
##
## $residual.scale
## [1] 37.87858