Generalized Least Squares

Correlated Response

In section 6.5 we discussed the case of correlated responses in the context of regression. We now return to this case in the general linear model.

In all the discussion so far (except section 6.5) we always assumed a model of the form \(\pmb{y=X\beta+\epsilon}\) with \(cov(\pmb{\epsilon})=\sigma^2\pmb{I}\). We now consider the case where \(cov(\pmb{\epsilon})=\sigma^2\pmb{D}\), where \(\pmb{D}\) is a known positive-definite matrix.

Because D is positive-definite there exists a non-singular matrix \(\pmb{D}^{1/2}\) such that \(\pmb{D}^{1/2}\pmb{D}^{1/2}=\pmb{D}\), see (4.3.13). Now consider the transformed model

\[\pmb{y}^*=\pmb{D}^{-1/2}\pmb{y}=\pmb{D}^{-1/2}\pmb{X\beta}+\pmb{D}^{-1/2}\pmb{\epsilon}\]

\[cov(\pmb{D}^{1/2}\pmb{\epsilon}) = \pmb{D}^{-1/2}\sigma^2\pmb{D}\pmb{D}^{-1/2} =\sigma^2\pmb{I}\]

and so this transformed model has uncorrelated errors with equal variance. Therefore we can apply the methods previously discussed.

Notice

\[ \begin{aligned} &(\pmb{y}^*-\pmb{X}^*\pmb{\hat{\beta}})'(\pmb{y}^*-\pmb{X}^*\pmb{\hat{\beta}}) = \\ &(\pmb{D}^{-1/2}\pmb{y}-\pmb{D}^{-1/2}\pmb{X}\pmb{\hat{\beta}})'(\pmb{D}^{-1/2}\pmb{y}-\pmb{D}^{-1/2}\pmb{X}\pmb{\hat{\beta}}) = \\ &(\pmb{y}-\pmb{X}\pmb{\hat{\beta}})'\pmb{D}^{-1}(\pmb{y}-\pmb{X}\pmb{\hat{\beta}}) \end{aligned} \]

and so the method of generalized least squares amounts to minimizing a quadratic form

\[(\pmb{y}-\pmb{X}\pmb{\hat{\beta}})'\pmb{D}^{-1}(\pmb{y}-\pmb{X}\pmb{\hat{\beta}})\]

The normal equations of the transformed model are

\[\pmb{X}^{*'}\pmb{X}^{*}\pmb{\hat{\beta}}=\pmb{X}^{*'}\pmb{y}^{*}\]

and so

\[ \begin{aligned} &(\pmb{D}^{-1/2}\pmb{X})'(\pmb{D}^{-1/2}\pmb{X})\pmb{\hat{\beta}}=(\pmb{D}^{-1/2}\pmb{X})'(\pmb{D}^{-1/2}\pmb{y}) \\ &\pmb{X}'\pmb{D}^{-1}\pmb{X}\pmb{\hat{\beta}}=\pmb{X}'\pmb{D}^{-1}\pmb{y} \\ &\pmb{\hat{\beta}}=[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1}\pmb{y} \end{aligned} \]

Theorem (8.2.1)

In a model of the form \(\pmb{y=X\beta+\epsilon}\) with \(cov(\pmb{\epsilon})=\sigma^2\pmb{D}\), the BLUE of a linear function \(\pmb{a'\beta}\) is \(\pmb{a'\hat{\beta}}\) where \(\pmb{\hat{\beta}}=[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1}\pmb{y}\).

Furthermore

\[var(\pmb{a'\hat{\beta}})=\sigma^2\pmb{a}'[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{a}\] Also

\[\text{SSE} = \pmb{y'D^{-1}y-(y'D^{-1}X)(X'D^{-1}X)^{-}(X'D^{-1}y)}\] and \(\text{SSE}/\sigma^2\sim \chi^2(n-r)\) and is independent of the distribution of the BLUES.

proof follows from the discussion above.

Theorem (8.2.2)

For testing a hypothesis of the form \(H_0: \pmb{K\beta}=d\), where \(\pmb{K}\) is \(m\times p\) of rank m, the sum of squares is

\[\text{SSH} = \pmb{(K\hat{\beta}-d)'[K(X'D^{-1}X)^{-}K']^{-1}(K\hat{\beta}-d)}\] which has m degrees of freedom. The F-test is

\[F=\frac{\text{SSH}/m}{\text{SSE}/(n-r)}\sim F(m, n-r)\] proof omitted, but easy.

Weighted Least Squares

Example (8.2.3)

We have following (artificial) data:

ggplot(data=df, aes(x, y)) +
  geom_point() 

so here we have a positive linear relationship, but as x increases so does the variance of y|x. Assuming that the observations are still uncorrelated we have a model of the form

\[y_i=\beta_0+\beta_1x_i+\epsilon_i\] and \(cov(\epsilon_i,\epsilon_j)=\sigma_i^2\delta_{ij}\). In other words the matrix \(\pmb{D}\) is

\[\pmb{D}=diag(\sigma_1^2,..,\sigma_n^2)\] so then

\[\pmb{D}^{-1}=diag(1/\sigma_1^2,..,1/\sigma_n^2)\]

Therefore the least squares criterion becomes

\[ \begin{aligned} &(\pmb{y}-\pmb{X}\pmb{\hat{\beta}})'\pmb{D}^{-1}(\pmb{y}-\pmb{X}\pmb{\hat{\beta}}) = \\ &(\pmb{y}-\pmb{X}\pmb{\hat{\beta}})'diag(1/\sigma_1^2,..,1/\sigma_n^2)(\pmb{y}-\pmb{X}\pmb{\hat{\beta}}) = \\ &\sum_{i=1}^n \left(y_i-(\pmb{X}\pmb{\hat{\beta})}_i\right)^2/\sigma_i^2 = \\ &\sum_{i=1}^n w_i\left(y_i-(\pmb{X}\pmb{\hat{\beta})}_i\right)^2 \\ \end{aligned} \]

where \(w_i=1/\sigma_i^2\) are called the weights, hence the name weighted least squares.

Example (8.2.4)

So, what are the weights in the example above? Ideally we would know these, but in practice we usually do not. Notice though that our data has multiple measurements at each x value, so we can estimate these:

w=1/tapply(y, x, var)
round(w, 4)
##      1      2      3      4      5      6      7      8      9     10 
## 1.0172 0.6565 0.1395 0.0987 0.8321 0.0087 0.0070 0.0178 0.0121 0.0087

and so the weighted least squares estimators of the coefficients are

X=cbind(1, x)
Dinv=diag(rep(w, each=5))
tmp=solve(t(X)%*%Dinv%*%X)
betahat= c(tmp%*%t(X)%*%Dinv%*%cbind(y))
betahat
## [1] 8.690910 2.375936

or

fit=lm(y~x, data=df, weights=rep(w, each=5))
coef(fit)
## (Intercept)           x 
##    8.690910    2.375936
ggplot(data=df, aes(x, y)) +
  geom_point() +
  geom_abline(xintercept=betahat[1],slope=betahat[2], size=2, col="blue")


If there is one predictor we can also do this directly:

\[ \begin{aligned} &0=\frac{d }{d \beta_0}\sum_{i=1}^n w_i(y_i-\beta_0-\beta_1 x_i)^2 = \\ &(-2)\sum_{i=1}^n w_i(y_i-\beta_0-\beta_1 x_i)=\\ &(-2)\left(\sum_{i=1}^n w_iy_i- n\beta_0\sum w_i-\sum_{i=1}^n \beta_1 w_ix_i\right) =\\ &(-2n)\left(\overline{wy}- \beta_0\bar{w}- \beta_1 \overline{wx}\right) \\ &0=\frac{d}{d \beta_1}\sum_{i=1}^n w_i(y_i-\beta_0-\beta_1 x_i)^2 = \\ &(-2)\sum_{i=1}^n w_i(y_i-\beta_0-\beta_1 x_i)x_i = \\ &(2n)\left(\overline{wxy}- \beta_0\bar{w}\bar{x}- \beta_1 \overline{wx^2}\right) \\ &\beta_0\bar{w}+ \beta_1 \overline{wx}=\overline{wy}\\ &\beta_0\overline{wx}+ \beta_1 \overline{wx^2} = \overline{wxy} \\ &\beta_0 = (\overline{wy} -\beta_1 \overline{wx})/\bar{w}\\ &\beta_1 = \frac{\overline{wy}\times \overline{wx} -\bar{w}\times \overline{wxy}}{\overline{wx}^2- \bar{w}\times \overline{wx^2}} \\ \end{aligned} \]

Notice that in the above did not actually know \(\pmb{D}\) but estimated it from the data. This is quite commonly done even so it violated the assumptions. This turns out to be generally ok.

Example (8.2.5)

x=df$x
y=df$y
w=rep(w, each=5)
bx=mean(x)
bw=mean(w)
bwx=mean(w*x)
bwy=mean(w*y)
bwxy=mean(w*x*y)
bwx2=mean(w*x^2)
betahat1=(bwx*bwy-bw*bwxy)/(bwx^2-bw*bwx2)
betahat0=(bwy-betahat1*bwx)/bw
round(c(betahat0, betahat1), 3)
## [1] 8.691 2.376

Sampling from a Small Population

Say we are selecting without replacement from a population that is so small that the probability of selecting the same object is not negligible. Let’s investigate the effect of this.

Notice that so far there is no linear model, this is a general problem.

Say the population is \(y_1,..,y_N\), then the population mean is \(\mu=\frac1N\sum y_i\) and the population variance is

\[\sigma^2=\frac1N\sum (y_i-\mu)^2 = \frac{(N-1)S^2}{N}\]

where \(S^2=\frac1{N-1}\sum (y_i-\bar{y})^2\).

Now say we draw a sample of size \(\{X_1,..,X_n\}\) from the population such that \(x_i\) is equally likely any of the \(y_j\)’s. This is called a simple random sample (SRS). So

\[ \begin{aligned} &E[X_i] = \sum_{i=1}^N y_i \frac1N = \bar{y}\\ &E[X_i^2] = \sum_{i=1}^N y_i^2 \frac1N \\ &E[X_iX_j] =E\{E[X_iX_j|X_j\}=\\ &E\{X_jE[X_i|X_j\} = \\ &E\{X_j\frac{\sum_{i\ne j} y_i}{N-1} \} = \\ &\frac1{N-1}E\{X_j[\sum_i y_i -X_j]\} = \\ &\frac1{N-1}E\{X_j[N\mu -X_j]\} = \\ &\frac{N\mu}{N-1}E\{X_j\} - \frac{1}{N-1}E\{X_j^2\}= \\ &\frac{N\mu}{N-1}\mu - \frac{1}{N-1}\frac1N\sum_{i=1}^N y_i^2 = \\ &\frac{N\mu^2}{N-1} - \frac{1}{N(N-1)}\sum_{i=1}^N y_i^2 \\ &cov(X_i,X_j) =E[X_iX_j]-E[X_i]E[X_j] = \\ &\frac{N\mu^2}{N-1} - \frac{1}{N(N-1)}\sum_{i=1}^N y_i^2 -\mu^2 =\\ &\frac{\mu^2}{N-1} - \frac{1}{N(N-1)}\sum_{i=1}^N y_i^2 =\\ &- \frac{1}{N(N-1)} \left(\sum_{i=1}^N y_i^2 -N\mu^2\right)=\\ &-S^2/N \end{aligned} \]

and also

\[var(X_i) = (N-1)S^2/N\]

So we can set up a model

\[X_i=\mu+\epsilon_i\]

where

\[cov(\pmb{\epsilon})=S^2(\pmb{I_n}-\frac1N \pmb{J}_{nn})\]

and so we have a generalized least squares model with \(\pmb{D}=\pmb{I_n}-\frac1N \pmb{J}_{nn}\) and \(\sigma^2=S^2\).

Note that \(\pmb{D}^{-1}=(\pmb{I}-\frac1N \pmb{J})^{-1}=\pmb{I}+\frac1{N-n} \pmb{J}\), so

\[ \begin{aligned} &\hat{\mu} = [\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1}\pmb{y} =\\ &[\pmb{j}'(\pmb{I}+\frac1{N-n} \pmb{J})\pmb{j}]^{-}\pmb{j}'(\pmb{I}+\frac1{N-n} \pmb{J})\pmb{y} = \\ &[\pmb{j}'\pmb{j}+\frac1{N-n} \pmb{j}'\pmb{J}\pmb{j}]^{-}(\pmb{j'y}+\frac1{N-n}\pmb{j}' \pmb{J}\pmb{y}) = \\ &[n+\frac1{N-n} n^2]^{-}(n\bar{y}+\frac1{N-n}n^2\bar{y}) = \\ &[n+\frac1{N-n} n^2]^{-}(n+\frac1{N-n}n^2)\bar{y} = \bar{y}\\ \end{aligned} \]

so the BLUE is \(\bar{y}\)!

Also

\[ \begin{aligned} &var(\hat{\mu}) = \sigma^2[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1} =\\ &\sigma^2[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1} = \frac{N-n}{Nn}S^2 \end{aligned} \] \(\frac{N-n}{Nn}\) is called the finite population correction.

Combining Experiments - Meta Analysis

Say we have k experiments, each taking measurements for the same quantity \(\theta\). Experiment k has the unbiased estimator \(T_k\). Is it possible to combine these experiments into a super-experiment? This is part of a branch of statistics called meta analysis.

We can write the model

\[T_k=\theta+\epsilon_i\]

where \(cov(\pmb{\epsilon})=\sigma^2\pmb{D}\). The BLUE of the combined experiment is therefore given by

\[\pmb{\hat{\theta}}=[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1}\pmb{T}=[\pmb{j}'\pmb{D}^{-1}\pmb{j}]^{-}\pmb{j}'\pmb{D}^{-1}\pmb{T}\] with variance

\[var(\pmb{\hat{\theta}})=[\pmb{j}'\pmb{D}^{-1}\pmb{j}]^{-}\]

  • independent experiments

If it can be assumed that the experiments are independent we have \(\pmb{D}=diag(\sigma^2_1,..,\sigma^2_k)\), so

\[ \begin{aligned} &\pmb{\hat{\theta}} = [\pmb{j}'diag(1/\sigma^2_1,..,1/\sigma^2_k)\pmb{j}]^{-1}\pmb{j}'diag(1/\sigma^2_1,..,1/\sigma^2_k)\pmb{T} = \\ &[\sum_i 1/\sigma_i^2]^{-1}\sum_i T_i/\sigma_i^2 = \\ &\frac{\sum_i T_i/\sigma_i^2}{\sum_i 1/\sigma_i^2} = \frac{\sum_i w_i T_i}{\sum_i w_i} \end{aligned} \]

where the weights w are defined as before. The variance is

\[var(\pmb{\hat{\theta}})=\frac{1}{\sum_i 1/\sigma_i^2}\]

Say we have two such experiment, one with \(\sigma^2=1\), then the variance of the combined experiment is

curve(1/(1+1/x), 0, 10)

and so we always gain something but we gain the most if the second experiment has a small variance.

  • Another interesting case is where all the experiments have the same variance \(\sigma^2\) but different sample sizes. We find

\[w_i=1/\sigma^2_i=1/(\sigma^2/n_i)=n_i/\sigma^2\]

and so

\[\pmb{\hat{\theta}} = \frac{\sum_i n_i/\sigma^2 T_i}{\sum_i n_i/\sigma^2} = \frac{\sum_i n_i T_i}{\sum_i n_i}\]

and finally if all the sample sizes were the same the BLUE is simply the mean of the Ti’s.

  • Case of two correlated experiments:

Say we have T1 and T2 with

\[\pmb{D} = \begin{pmatrix} \sigma_1^2 & \sigma_1\sigma_2\rho \\\sigma_1\sigma_2\rho & \sigma_2^2 \end{pmatrix}\]

then

\[\pmb{D}^{-1} =\frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \begin{pmatrix} \sigma_2^2 & -\sigma_1\sigma_2\rho \\ -\sigma_1\sigma_2\rho & \sigma_1^2 \end{pmatrix}\] \[ \begin{aligned} &\pmb{j}'\pmb{D}^{-1}\pmb{j} = \frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \begin{pmatrix} 1 & 1 \end{pmatrix}\begin{pmatrix} \sigma_2^2 -\sigma_1\sigma_2\rho \\ \sigma_1^2 -\sigma_1\sigma_2\rho \end{pmatrix} = \\ &\frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \left[\sigma_2^2 -\sigma_1\sigma_2\rho + \sigma_1^2 -\sigma_1\sigma_2\rho\right] = \\ &\frac{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho}{\sigma_1\sigma_2(1-\rho^2)} \\ \end{aligned} \]

and so

\[ \begin{aligned} &\pmb{\hat{\theta}}=[\pmb{j}'\pmb{D}^{-1}\pmb{j}]^{-}\pmb{j}'\pmb{D}^{-1}\pmb{T} = \\ &[\frac{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho}{\sigma_1\sigma_2(1-\rho^2)}]^{-1}\pmb{j}'\frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \begin{pmatrix} \sigma_2^2 & -\sigma_1\sigma_2\rho \\ -\sigma_1\sigma_2\rho & \sigma_1^2 \end{pmatrix}\pmb{T} = \\ &\frac{\sigma_1\sigma_2(1-\rho^2)}{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho}\frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \pmb{j}'\begin{pmatrix} \sigma_2^2T_1 -\sigma_1\sigma_2\rho T_2 \\ -\sigma_1\sigma_2\rho T_1 + \sigma_1^2 T_2\end{pmatrix} = \\ &\frac{(\sigma_2^2-\sigma_1\sigma_2\rho)T_1+(\sigma_1^2-\sigma_1\sigma_2\rho)T_2 }{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho} = \\ &\frac{a_2 T_1+ a_1 T_2}{a_1+a_2} \\ \end{aligned} \]

where \(a_i=\sigma_i^2-\sigma_1\sigma_2\rho\)

The variance of the combined estimator is

\[var(\pmb{\hat{\beta}})=[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-} = \frac{\sigma_1\sigma_2(1-\rho^2)}{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho}\]

Let’s investigate this a bit. To do so we fix \(\sigma_1=\sigma_2=1\), so we have

\[var(\pmb{\hat{\beta}})= \frac{1-\rho^2}{2-2\rho}=\frac{(1-\rho)(1+\rho)}{2(1-\rho)}=\frac{1+\rho}{2}\]

Note that \(-1\le \rho\le 1\), so \(0\le \frac{1+\rho}{2}\le 1 = var(T_i)\), so there is always an improvement but the best case is two highly negatively correlated experiments! If the two experiments are independent \(\rho=0\) and the variance is 1/2.