Generalized Least Squares

Until now we had the assumption that the y variables were independent. We will now study the case where \(cov(\pmb{y})=\sigma^2\pmb{V}\). One common case is were the variance of the y’s increases (or decreases) as the x’s increase (or decrease). Another is if the x’s are time points, and one would expect responses close together in time to have some correlation.

So the model now is

\[\pmb{y}=\pmb{X\beta}+\pmb{\epsilon}\text{, }E[\pmb{y}]=\pmb{X\beta}\text{, }cov(\pmb{y})=\sigma^2\pmb{V}\]

where \(\pmb{X}\) is full-rank and \(\pmb{V}\) is a known positive definite matrix.

Notice that \(\pmb{V}\) is an \(n\times n\) matrix, and we have n observations, so estimation of \(\pmb{V}\) is not possible. Sometimes additional information on \(\pmb{V}\) is available and estimation is possible.

Estimation

Theorem (6.5.1)

Under the model above we have

  1. The BLUE estimator of \(\pmb{\beta}\) is

\[\pmb{\hat{\beta}} = (\pmb{X'V^{-1}X})^{-1}\pmb{X'V^{-1}y}\]

\[cov(\pmb{\hat{\beta}}) =\sigma^2 (\pmb{X'V^{-1}X})^{-1}\]

  1. an unbiased estimator of \(\sigma^2\) is

\[\hat{s}^2 = (\pmb{y}-\pmb{X\hat{\beta}})'\pmb{V}^{-1}(\pmb{y}-\pmb{X\hat{\beta}})/(n-k-1)\]

proof omitted

Theorem (6.5.2)

say \(\pmb{y}\sim N_n(\pmb{X\beta}, \sigma^2\pmb{V})\), then the maximum likelihood estimators are

\[\pmb{\hat{\beta}} = (\pmb{X'V^{-1}X})^{-1}\pmb{X'V^{-1}y}\]

\[\hat{s}^2 = (\pmb{y}-\pmb{X\hat{\beta}})'\pmb{V}^{-1}(\pmb{y}-\pmb{X\hat{\beta}})/n\]

proof omitted

Example (6.5.3)

Recall the centered model

\[\pmb{y}=\begin{pmatrix} \pmb{j} & \pmb{X}_c \end{pmatrix} \begin{pmatrix} \pmb{\alpha} \\ \pmb{\beta}_1 \end{pmatrix}+\pmb{\epsilon}\]

with covariance pattern

\[ \pmb{\Sigma} = \begin{pmatrix} 1 & \rho & ... &\rho \\ \rho & 1 & ... & \rho\\ \vdots & \vdots & & \vdots \\ \rho & \rho & ... & 1 \end{pmatrix}=\\ \sigma^2\left[(1-\rho)\pmb{I}+\rho \pmb{J} \right]=\sigma^2\pmb{V} \]

so all variables have equal variance and any pair has the same correlation.

Note

\[ \begin{aligned} &\pmb{X'V^{-1}X} = \\ &\begin{pmatrix} \pmb{j}' \\ \pmb{X}_c' \end{pmatrix}\pmb{V}^{-1}\begin{pmatrix} \pmb{j} && \pmb{X}_c \end{pmatrix} =\\ &\begin{pmatrix} \pmb{j}'\pmb{V}^{-1}\pmb{j} & \pmb{j}'\pmb{V}^{-1}\pmb{X}_c \\ \pmb{X}_c'\pmb{V}^{-1}\pmb{j} & \pmb{X}_c'\pmb{V}^{-1}\pmb{X}_c \end{pmatrix} \end{aligned} \]

We can find

\[\pmb{V}^{-1} = a(\pmb{I}-b\rho\pmb{J})\]

where \(a=1/(1-\rho)\) and \(b=1/[(1+(n-1)\rho)]\). Now

\[ \begin{aligned} &\pmb{X'V^{-1}X} = \\ \text{ }\\ &\begin{pmatrix} \pmb{j}'a(\pmb{I}-b\rho\pmb{J})\pmb{j} & \pmb{j}'a(\pmb{I}-b\rho\pmb{J})\pmb{X}_c \\ \pmb{X}_c'a(\pmb{I}-b\rho\pmb{J})\pmb{j} & \pmb{X}_c'a(\pmb{I}-b\rho\pmb{J})\pmb{X}_c \end{pmatrix}=\\ \text{ }\\ &\begin{pmatrix} a\pmb{j}'\pmb{j}-ab\rho\pmb{j}'\pmb{J}\pmb{j} & a\pmb{j}'\pmb{X}_c-ab\rho\pmb{j}'\pmb{J}\pmb{X}_c \\ a\pmb{X}_c'\pmb{j}-ab\rho\pmb{X}_c'\pmb{J}\pmb{j} & a\pmb{X}_c'\pmb{X}_c-ab\rho\pmb{X}_c'\pmb{J}\pmb{X}_c \end{pmatrix}=\\ \text{ }\\ &\begin{pmatrix} an-ab\rho n^2 & a\pmb{j}'\pmb{X}_c-ab\rho n\pmb{j}'\pmb{X}_c \\ a\pmb{X}_c'\pmb{j}-ab\rho\pmb{X}_cn\pmb{j} & a\pmb{X}_c'\pmb{X}_c-ab\rho\pmb{X}_c'\pmb{J}\pmb{X}_c \end{pmatrix}=\\ \text{ }\\ &\begin{pmatrix} an(1-bn\rho) & a(1-bn\rho)\pmb{j}'\pmb{X}_c \\ a(1-bn\rho)\pmb{X}_c'\pmb{j} & \pmb{X}_c'a(1-b\rho\pmb{J})\pmb{X}_c \end{pmatrix} = \\ \text{ }\\ &\begin{pmatrix} bn & \pmb{0}' \\ \pmb{0} & a\pmb{X}_c'\pmb{X}_c \end{pmatrix} \end{aligned} \]

because \(\pmb{X}_c\) is the centered matrix and so \(\pmb{j'X}_c=\pmb{0}'\). Also we have

\[\pmb{X'V^{-1}y}= \pmb{X} = \begin{pmatrix} bn\bar{y} \\ a\pmb{X}_c'\pmb{y} \end{pmatrix}\]

and so

\[\begin{pmatrix} \hat{\alpha} \\ \hat{\beta}_1 \end{pmatrix} = \begin{pmatrix} \bar{y} \\ (\pmb{X}_c'\pmb{X}_c)^{-1}\pmb{X}_c\pmb{y} \end{pmatrix}\]

Weighted Regression

Example (6.5.4)

Suppose we have a simple regression problem of the form

\[y_i=\beta_0+\beta_1x_i+\epsilon_i\]

where \(var(y_i)=\sigma^2x_i\) and \(cov(y_i,y_j)=0\) for all \(i\ne j\). (this is an example of a weighted regression model). So we have

\[ \pmb{V} = \sigma^2 \begin{pmatrix} x_1 & 0 & ... & 0 \\ 0 & x_2 & ... & 0 \\ \vdots &\vdots & &\vdots \\ 0 & 0 & .. & x_n \end{pmatrix} \]

and

\[ \pmb{X} = \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \\ \end{pmatrix} \]

and

\[ \pmb{V}^{-1} = 1/\sigma^2 \begin{pmatrix} 1/x_1 & 0 & ... & 0 \\ 0 & 1/x_2 & ... & 0 \\ \vdots &\vdots & &\vdots \\ 0 & 0 & .. & 1/x_n \end{pmatrix} \]

\[ \begin{aligned} &\pmb{X'V^{-1}X} = \\ &\begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \\ \end{pmatrix}' 1/\sigma^2 \begin{pmatrix} 1/x_1 & 0 & ... & 0 \\ 0 & 1/x_2 & ... & 0 \\ \vdots &\vdots & &\vdots \\ 0 & 0 & .. & 1/x_n \end{pmatrix} \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \\ \end{pmatrix} = \\ &1/\sigma^2 \begin{pmatrix} 1/x_1 & 1/x_2 & ... & 1/x_n \\ 1 & 1 & ... & 1 \\ \end{pmatrix} \begin{pmatrix} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \\ \end{pmatrix} = \\ &\begin{pmatrix} \sum 1/x_i & n \\ n & \sum x_i \end{pmatrix} \\ \text{ }\\ &(\pmb{X'V^{-1}X})^{-1} = \\ &\frac{\sigma^2}{(\sum 1/x_i)(\sum x_i)-n^2} \begin{pmatrix} \sum x_i & -n \\ -n & \sum 1/x_i \end{pmatrix} \\ \end{aligned} \]

\[ \begin{aligned} &\pmb{X'V^{-1}y} = \\ &1/\sigma^2 \begin{pmatrix} 1/x_1 & 1/x_2 & ... & 1/x_n \\ 1 & 1 & ... & 1 \\ \end{pmatrix} \begin{pmatrix} y_1 \\ \vdots \\ y_n \\ \end{pmatrix} =\\ &1/\sigma^2 \begin{pmatrix} \sum y_i/x_i \\ \sum y_i \end{pmatrix} \end{aligned} \]

and so

\[ \begin{aligned} &\pmb{\hat{\beta}} = (\pmb{X'V^{-1}X})^{-1}\pmb{X'V^{-1}y} =\\ &\text{}\\ &\frac{\sigma^2}{(\sum 1/x_i)(\sum x_i)-n^2} \begin{pmatrix} \sum x_i & -n \\ -n & \sum 1/x_i \end{pmatrix} 1/\sigma^2 \begin{pmatrix} \sum y_i/x_i \\ \sum y_i \end{pmatrix} = \\ &\frac{\sigma^2}{(\sum 1/x_i)(\sum x_i)-n^2} \begin{pmatrix} (\sum y_i/x_i)(\sum x_i)-n\sum y_i \\ (\sum 1/x_i)(\sum y_i)-n(\sum y_i/x_i) \end{pmatrix} \end{aligned} \]

Let’s do a numerical example using R: say x=1,..,5 and y=100+10x+N(0, 5x), then

x=1:5
y=cbind(100+10*x+rnorm(5, 0, 5*x))
plot(x,y)

V=diag(x)
V
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    2    0    0    0
## [3,]    0    0    3    0    0
## [4,]    0    0    0    4    0
## [5,]    0    0    0    0    5
X=cbind(1,x)
X
##        x
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
## [4,] 1 4
## [5,] 1 5
Vinf=diag(1/x)
Xp.Vinf.X = t(X)%*%Vinf%*%X
c(sum(1/x), sum(x))
## [1]  2.283333 15.000000
Xp.Vinf.X
##             x
##   2.283333  5
## x 5.000000 15
A=solve(Xp.Vinf.X)
c(sum(x), -5, sum(1/x))/((sum(1/x)*sum(x)-5^2))
## [1]  1.6216216 -0.5405405  0.2468468
A
##                       x
##    1.6216216 -0.5405405
## x -0.5405405  0.2468468
B=t(X)%*%Vinf%*%y
c(sum(y/x), sum(y))
## [1] 268.3518 604.1924
B
##       [,1]
##   268.3518
## x 604.1924
c(sum(y/x)*sum(x)-5*sum(y),
  sum(1/x)*sum(y)-5*sum(y/x))/((sum(1/x)*sum(x)-5^2))
## [1] 108.574538   4.087978
A%*%B
##         [,1]
##   108.574538
## x   4.087978

so this works very well!

We can fit a weighted regression also with the R routine lm and the argument weights. These are the inverses of the variances:

summary(lm(y~x, weights=1/x))
## 
## Call:
## lm(formula = y ~ x, weights = 1/x)
## 
## Weighted Residuals:
##       1       2       3       4       5 
##  0.3303 -1.7673  0.7326  3.1637 -2.4271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  108.575      3.261  33.298 5.95e-05
## x              4.088      1.272   3.213   0.0488
## 
## Residual standard error: 2.561 on 3 degrees of freedom
## Multiple R-squared:  0.7749, Adjusted R-squared:  0.6998 
## F-statistic: 10.33 on 1 and 3 DF,  p-value: 0.04883

Example (6.5.5)

Let’s do a simple simulation to see the difference between ordinary and weighted regression. We use the model similar to the last example:

gen.data=function(n=50, beta0=100, beta1=10, sigma2=1) {
   x=seq(1, 5, length=n)
   y=beta0+beta1*x+rnorm(n, 0, sigma2*x)
   cbind(x, y)
}
plot(gen.data())

Now we do the following: we generate a data set, estimate the parameters with both ordinary and weighted least squares, and repeat this 1000 times:

B=1000
a=matrix(0, B, 4)
for(i in 1:B) {
   x=gen.data()
   a[i, 1:2]=coef(lm(x[ ,2]~x[ ,1]))
   a[i, 3:4]=coef(lm(x[ ,2]~x[ ,1], weights=1/x[ ,1]))
}

Let’s see:

par(mfrow=c(2,2))
for(i in 1:4) 
   hist(a[, i], 50, main="")

apply(a, 2, mean)
## [1] 99.999575  9.999502 99.998931  9.999717
apply(a, 2, var)
## [1] 1.0577596 0.1791154 0.6136294 0.1233135

and so we see that the weighted least squares estimators have smaller variance