Solution to Final Exam Linear Models ESMA 6616

Problem 1

Say we have a simple regression model \(y_i=\beta_0+\beta_1 x_i+\epsilon_i\), \(\pmb{\epsilon}\sim N(\pmb{0},\sigma^2\pmb{I})\).

Derive a test for

\[H_0:\beta_1=b\text{ vs. }H_a:\beta_1\ne b\]

Let’s rewrite the model as follows

\[ \begin{aligned} &z_i = y_i-bx_i = \\ &\beta_0+(\beta_1-b) x_i +\epsilon_i = \\ &\beta_0+\gamma_1 x_i +\epsilon_i \end{aligned} \]

Now a test

\[H_0:\beta_1=b\text{ vs. }H_a:\beta_1\ne b\]

is equivalent to

\[H_0:\gamma_1=0\text{ vs. }H_a:\gamma_1\ne 0\]

By (6.1.6) under the null hypothesis

\[t=\frac{\hat{\gamma_1}}{s/\sqrt{\sum (x_i-\bar{x})^2}}\sim t(n-2)\]

Now

\[ \begin{aligned} &\bar{z} = \frac1n \sum (y_i-bx_i) = \bar{y}-b\bar{x}\\ &\hat{\gamma_1} = \frac{\sum (x_i-\bar{x})(z_i-\bar{z})}{\sum (x_i-\bar{x})^2} =\\ &\frac{\sum (x_i-\bar{x})(y_i-bx_i-\bar{y}+b\bar{x})}{\sum (x_i-\bar{x})^2} = \\ &\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}-b\frac{\sum (x_i-\bar{x})^2}{\sum (x_i-\bar{x})^2} = \hat{\beta_1}-b \end{aligned} \]

and so

\[t=\frac{\hat{\beta_1}-b}{s/\sqrt{\sum (x_i-\bar{x})^2}}\sim t(n-2)\]

Use the test in a. to find a \((1-\alpha)100\%\) confidence interval for \(\beta_1\).

\[ \begin{aligned} &1-\alpha = P(\vert \frac{\hat{\beta_1}-b}{s/\sqrt{\sum (x_i-\bar{x})^2}}\vert <t_{\alpha/2, n-2}) = \\ &P(-t_{\alpha/2, n-2} < \frac{\hat{\beta_1}-b}{s/\sqrt{\sum (x_i-\bar{x})^2}} <t_{\alpha/2, n-2}) = \\ &P(-t_{\alpha/2, n-2}s/\sqrt{\sum (x_i-\bar{x})^2} < \hat{\beta_1}-b<t_{\alpha/2, n-2}s/\sqrt{\sum (x_i-\bar{x})^2}) = \\ &P(\hat{\beta_1}-t_{\alpha/2, n-2}s/\sqrt{\sum (x_i-\bar{x})^2} < b<\hat{\beta_1}+t_{\alpha/2, n-2}s/\sqrt{\sum (x_i-\bar{x})^2}) \end{aligned} \]

so a \((1-\alpha)100\%\) confidence interval for \(\beta_1\) is given by

\[\hat{\beta_1}\pm t_{\alpha/2, n-2}s/\sqrt{\sum (x_i-\bar{x})^2}\]

Problem 2

Say we have a regression model \(y_i=\beta_0+\beta_1 x_i+\beta_2 z_i+\epsilon_i\), \(\pmb{\epsilon}\sim N(\pmb{0},\sigma^2\pmb{I})\). Find a \((1-\alpha)100\%\) confidence interval for \(\delta=\beta_1-\beta_2\).

By (6.6.14) a test for

\[H_0: \pmb{a'\beta}=0\text{ vs. }H_0: \pmb{a'\beta\ne 0}\] can be based on \(F=\frac{\pmb{(a'\hat{\beta}})^2}{s^2\pmb{a'(X'X)^{-1}a}}\sim F(1,n-3)\). Using \(\pmb{a}=(1 -1)'\) we have \(F=\frac{\pmb{(\hat{\beta}_1-\hat{\beta}_0})^2}{s^2\pmb{a'(X'X)^{-1}a}}\sim F(1,n-3)\). Equivalently we can do the test based on

\(t=\frac{\pmb{\hat{\beta}_1-\hat{\beta}_0}}{s\sqrt{\pmb{a'(X'X)^{-1}a}}}\sim t(n-3)\)

Inverting this test gives a \((1-\alpha)100\%\) confidence interval for \(\delta=\beta_1-\beta_2\) of the form

\[\hat{\beta}_1-\hat{\beta}_0\pm t_{\alpha/2,n-3}s\sqrt{\pmb{a'(X'X)^{-1}a}}\]

Problem 3

In a drug trial participants were divided into three groups: No Drug, Placebo and Drug. The hours until marked improvement were noted. The data is

## $`No Drug`
##  [1] 17.7 18.5 19.0 19.1 19.5 19.7 19.7 19.8 20.1 20.2
## 
## $Placebo
## [1] 18.4 19.6 19.6 19.8 19.9 20.4 20.8 21.8
## 
## $Drug
##  [1] 21.2 21.7 21.8 22.8 23.2 23.2 23.3 23.3 23.4 23.4 23.6 25.7

Applying the summary to the aov command yields

summary(aov(Time~Treatment, data=df))

##             Df Sum Sq Mean Sq F value  Pr(>F)
## Treatment    2  85.73   42.86   42.94 4.1e-09
## Residuals   27  26.95    1.00

Write an R routine myaovsummary that takes data from ANY one-way ANOVA experiment and print the same output. Your routine should NOT use either the aov or lm commands.

myaovsummary <- function(x, y) {
   k=length(unique(x))
   ni=tapply(y, x, length)
   n=sum(ni)
   y..=sum(y)
   yi.=tapply(y, x, sum)
   sse=sum(y^2)-sum(yi.^2/ni)
   ssalpha=sum(yi.^2/ni)-y..^2/n
   FTS=(ssalpha/(k-1))/(sse/(n-k))
   out=data.frame(c(paste(k-1), n-k),
                  round(c(ssalpha, sse), 2), 
                  round(c(ssalpha/(k-1), sse/(n-k)), 2), 
                  c(round(FTS, 2), " "),
                  c(round(1-pf(FTS, k-1, k*(n-1)), 4)," "))
   colnames(out)=c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
   rownames(out)=c("Treatment", "Residuals")
   out                   
}

myaovsummary(df$Treatment, df$Time)

##           Df Sum Sq Mean Sq F value Pr(>F)
## Treatment  2  85.73   42.86   42.94      0
## Residuals 27  26.95    1.00

Problem 4

A pharmaceutical company set up an experiment in which patients with a common type of headache were treated with a new analgesic or pain reliever. The analgesic was given to each patient in one of four dosage levels: 2,5,7 or 10 grams. Then the time until noticeable relieve was recorded in minutes. In addition the sex and the blood pressure of each patient was recorded. The blood pressure groups where formed by comparing each patients diastolic and systolic pressure reading with historical data. Based on this comparison the patients are assigned to one of three types: low (0.25), medium (0.5), high (0.75) according to the respective quantiles of the historic data.

The data set is called headache and is part of Resma3.

Find best (statistically significantly) best dosage level(s) for each patient. Make sure your analysis is complete (that is check assumptions etc)

Here the response variable is Time. Dose is a quantitative variable, Sex and BP.Quan are categorical. We will analyze this as a regression problem with dummy variables. Note that Sex should be treated an unordered factor and BP.Quan as an ordered factor:

headache$BP.Quan = factor(headache$BP.Quan,
                          levels=c(0.25, 0.50, 0.75),
                          ordered=TRUE)
headache$Sex = factor(headache$Sex,
                          levels=0:1,
                          ordered=FALSE)

ggplot(data=headache, aes(Dose, Time, color=Sex)) +
   geom_point() +
   geom_smooth(method = "lm", se=FALSE)

There does not seem to be interaction

ggplot(data=headache, aes(Dose, Time, color=BP.Quan)) +
   geom_point() +
   geom_smooth(method = "lm", se=FALSE)

There seems to be some interaction

fit=lm(Time~Dose*Sex*BP.Quan, data=headache)
summary(fit)

## 
## Call:
## lm(formula = Time ~ Dose * Sex * BP.Quan, data = headache)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8088 -4.6103  0.3676  3.9375  7.0147 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)         51.32353    4.21236  12.184 4.08e-08
## Dose                -4.63725    0.63146  -7.344 8.93e-06
## Sex1                -0.33333    5.95718  -0.056   0.9563
## BP.Quan.L           17.08495    7.29602   2.342   0.0373
## BP.Quan.Q            1.89115    7.29602   0.259   0.7999
## Dose:Sex1            1.00000    0.89302   1.120   0.2847
## Dose:BP.Quan.L      -2.64125    1.09372  -2.415   0.0326
## Dose:BP.Quan.Q      -0.06004    1.09372  -0.055   0.9571
## Sex1:BP.Quan.L      -3.28597   10.31813  -0.318   0.7556
## Sex1:BP.Quan.Q      11.44296   10.31813   1.109   0.2892
## Dose:Sex1:BP.Quan.L -0.04159    1.54675  -0.027   0.9790
## Dose:Sex1:BP.Quan.Q -1.87314    1.54675  -1.211   0.2492
## 
## Residual standard error: 6.377 on 12 degrees of freedom
## Multiple R-squared:    0.9,  Adjusted R-squared:  0.8084 
## F-statistic:  9.82 on 11 and 12 DF,  p-value: 0.0002111

The only significant factors are Dose, BP.Quan and their interaction. Let’s check the assumptions of that model:

fit=lm(Time~Dose*BP.Quan, data=headache)
summary(fit)

## 
## Call:
## lm(formula = Time ~ Dose * BP.Quan, data = headache)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.677  -4.040  -1.500   4.228  14.294 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)     51.1569     3.1456  16.263 3.31e-12
## Dose            -4.1373     0.4715  -8.774 6.42e-08
## BP.Quan.L       15.4420     5.4483   2.834  0.01100
## BP.Quan.Q        7.6126     5.4483   1.397  0.17933
## Dose:BP.Quan.L  -2.6620     0.8167  -3.259  0.00435
## Dose:BP.Quan.Q  -0.9966     0.8167  -1.220  0.23813
## 
## Residual standard error: 6.735 on 18 degrees of freedom
## Multiple R-squared:  0.8327, Adjusted R-squared:  0.7863 
## F-statistic: 17.92 on 5 and 18 DF,  p-value: 1.98e-06

df <- data.frame(Residuals=resid(fit), 
               Fits = fitted(fit),
               BP.Quan = headache$BP.Quan)
ggplot(data=df, aes(Fits, Residuals, color=BP.Quan)) +
              geom_point() +
              geom_smooth(method = "lm", se=FALSE)+
              geom_hline(yintercept = 0)

ggplot(data=df, aes(sample=Residuals)) +
             geom_qq() + geom_qq_line()

and so the residual vs. fits plot and the normal plot look good.

Which combinations ar best? Let’s again check the fitted line plot:

ggplot(data=headache, aes(Dose, Time, color=BP.Quan)) +
   geom_point() +
   geom_smooth(method = "lm", se=FALSE)

so on all levels of blood pressure the 10 mg pill is best. Is it statistically significantly better than the 7mg pill? Let’s find the prediction intervals:

bp=levels(headache$BP.Quan)
bp

## [1] "0.25" "0.5"  "0.75"

predict(fit, newdata=data.frame(Dose=c(2, 5, 7,10), 
                     BP.Quan=c(bp[1], bp[1],bp[1],bp[1])),
        interval="prediction")

##        fit        lwr      upr
## 1 38.02206 21.5190064 54.52511
## 2 30.03676 14.9309462 45.14258
## 3 24.71324  9.6074168 39.81905
## 4 16.72794  0.2248888 33.23099

all the intervals overlap, so there is no stat. significant difference

bp=levels(headache$BP.Quan)
bp

## [1] "0.25" "0.5"  "0.75"

predict(fit, newdata=data.frame(Dose=c(2, 5, 7,10), 
                     BP.Quan=c(bp[2], bp[2],bp[2],bp[2])),
        interval="prediction")

##        fit       lwr      upr
## 1 38.29412 21.791065 54.79717
## 2 28.32353 13.217711 43.42935
## 3 21.67647  6.570652 36.78229
## 4 11.70588 -4.797170 28.20893

Again all the intervals overlap, so there is no stat. significant difference

bp=levels(headache$BP.Quan)
bp

## [1] "0.25" "0.5"  "0.75"

predict(fit, newdata=data.frame(Dose=c(2, 5, 7,10), 
                     BP.Quan=c(bp[3], bp[3],bp[3],bp[3])),
        interval="prediction")

##          fit        lwr      upr
## 1 52.3308824  35.827830 68.83393
## 2 33.0514706  17.945652 48.15729
## 3 20.1985294   5.092711 35.30435
## 4  0.9191176 -15.583935 17.42217

Here the first and the last two intervals do not overlap, so there is some stat. significant difference. Using either 7 or 10 mg is better than using 2 mg.