We have following (artificial) data:
ggplot(data=df, aes(x, y)) +
geom_point()
so here we have a positive linear relationship, but as x increases so does the variance of y|x. Assuming that the observations are still uncorrelated we have a model of the form
\[y_i=\beta_0+\beta_1x_i+\epsilon_i\] and \(cov(\epsilon_i,\epsilon_j)=\sigma_i^2\delta_{ij}\). In other words the matrix \(\pmb{D}\) is
\[\pmb{D}=diag(\sigma_1^2,..,\sigma_n^2)\] so then
\[\pmb{D}^{-1}=diag(1/\sigma_1^2,..,1/\sigma_n^2)\]
Therefore the least squares criterion becomes
\[ \begin{aligned} &(\pmb{y}-\pmb{X}\pmb{\hat{\beta}})'\pmb{D}^{-1}(\pmb{y}-\pmb{X}\pmb{\hat{\beta}}) = \\ &(\pmb{y}-\pmb{X}\pmb{\hat{\beta}})'diag(1/\sigma_1^2,..,1/\sigma_n^2)(\pmb{y}-\pmb{X}\pmb{\hat{\beta}}) = \\ &\sum_{i=1}^n \left(y_i-(\pmb{X}\pmb{\hat{\beta})}_i\right)^2/\sigma_i^2 = \\ &\sum_{i=1}^n w_i\left(y_i-(\pmb{X}\pmb{\hat{\beta})}_i\right)^2 \\ \end{aligned} \]
where \(w_i=1/\sigma_i^2\) are called the weights, hence the name weighted least squares.
So, what are the weights in the example above? Ideally we would know these, but in practice we usually do not. Notice though that our data has multiple measurements at each x value, so we can estimate these:
w=1/tapply(y, x, var)
round(w, 4)
## 1 2 3 4 5 6 7 8 9 10
## 1.0172 0.6565 0.1395 0.0987 0.8321 0.0087 0.0070 0.0178 0.0121 0.0087
and so the weighted least squares estimators of the coefficients are
X=cbind(1, x)
Dinv=diag(rep(w, each=5))
tmp=solve(t(X)%*%Dinv%*%X)
betahat= c(tmp%*%t(X)%*%Dinv%*%cbind(y))
betahat
## [1] 8.690910 2.375936
or
fit=lm(y~x, data=df, weights=rep(w, each=5))
coef(fit)
## (Intercept) x
## 8.690910 2.375936
ggplot(data=df, aes(x, y)) +
geom_point() +
geom_abline(xintercept=betahat[1],slope=betahat[2], size=2, col="blue")
If there is one predictor we can also do this directly:
\[ \begin{aligned} &0=\frac{d }{d \beta_0}\sum_{i=1}^n w_i(y_i-\beta_0-\beta_1 x_i)^2 = \\ &(-2)\sum_{i=1}^n w_i(y_i-\beta_0-\beta_1 x_i)=\\ &(-2)\left(\sum_{i=1}^n w_iy_i- n\beta_0\sum w_i-\sum_{i=1}^n \beta_1 w_ix_i\right) =\\ &(-2n)\left(\overline{wy}- \beta_0\bar{w}- \beta_1 \overline{wx}\right) \\ &0=\frac{d}{d \beta_1}\sum_{i=1}^n w_i(y_i-\beta_0-\beta_1 x_i)^2 = \\ &(-2)\sum_{i=1}^n w_i(y_i-\beta_0-\beta_1 x_i)x_i = \\ &(2n)\left(\overline{wxy}- \beta_0\bar{w}\bar{x}- \beta_1 \overline{wx^2}\right) \\ &\beta_0\bar{w}+ \beta_1 \overline{wx}=\overline{wy}\\ &\beta_0\overline{wx}+ \beta_1 \overline{wx^2} = \overline{wxy} \\ &\beta_0 = (\overline{wy} -\beta_1 \overline{wx})/\bar{w}\\ &\beta_1 = \frac{\overline{wy}\times \overline{wx} -\bar{w}\times \overline{wxy}}{\overline{wx}^2- \bar{w}\times \overline{wx^2}} \\ \end{aligned} \]
Notice that in the above did not actually know \(\pmb{D}\) but estimated it from the data. This is quite commonly done even so it violated the assumptions. This turns out to be generally ok.
x=df$x
y=df$y
w=rep(w, each=5)
bx=mean(x)
bw=mean(w)
bwx=mean(w*x)
bwy=mean(w*y)
bwxy=mean(w*x*y)
bwx2=mean(w*x^2)
betahat1=(bwx*bwy-bw*bwxy)/(bwx^2-bw*bwx2)
betahat0=(bwy-betahat1*bwx)/bw
round(c(betahat0, betahat1), 3)
## [1] 8.691 2.376
Say we are selecting without replacement from a population that is so small that the probability of selecting the same object is not negligible. Let’s investigate the effect of this.
Notice that so far there is no linear model, this is a general problem.
Say the population is \(y_1,..,y_N\), then the population mean is \(\mu=\frac1N\sum y_i\) and the population variance is
\[\sigma^2=\frac1N\sum (y_i-\mu)^2 = \frac{(N-1)S^2}{N}\]
where \(S^2=\frac1{N-1}\sum (y_i-\bar{y})^2\).
Now say we draw a sample of size \(\{X_1,..,X_n\}\) from the population such that \(x_i\) is equally likely any of the \(y_j\)’s. This is called a simple random sample (SRS). So
\[ \begin{aligned} &E[X_i] = \sum_{i=1}^N y_i \frac1N = \bar{y}\\ &E[X_i^2] = \sum_{i=1}^N y_i^2 \frac1N \\ &E[X_iX_j] =E\{E[X_iX_j|X_j\}=\\ &E\{X_jE[X_i|X_j\} = \\ &E\{X_j\frac{\sum_{i\ne j} y_i}{N-1} \} = \\ &\frac1{N-1}E\{X_j[\sum_i y_i -X_j]\} = \\ &\frac1{N-1}E\{X_j[N\mu -X_j]\} = \\ &\frac{N\mu}{N-1}E\{X_j\} - \frac{1}{N-1}E\{X_j^2\}= \\ &\frac{N\mu}{N-1}\mu - \frac{1}{N-1}\frac1N\sum_{i=1}^N y_i^2 = \\ &\frac{N\mu^2}{N-1} - \frac{1}{N(N-1)}\sum_{i=1}^N y_i^2 \\ &cov(X_i,X_j) =E[X_iX_j]-E[X_i]E[X_j] = \\ &\frac{N\mu^2}{N-1} - \frac{1}{N(N-1)}\sum_{i=1}^N y_i^2 -\mu^2 =\\ &\frac{\mu^2}{N-1} - \frac{1}{N(N-1)}\sum_{i=1}^N y_i^2 =\\ &- \frac{1}{N(N-1)} \left(\sum_{i=1}^N y_i^2 -N\mu^2\right)=\\ &-S^2/N \end{aligned} \]
and also
\[var(X_i) = (N-1)S^2/N\]
So we can set up a model
\[X_i=\mu+\epsilon_i\]
where
\[cov(\pmb{\epsilon})=S^2(\pmb{I_n}-\frac1N \pmb{J}_{nn})\]
and so we have a generalized least squares model with \(\pmb{D}=\pmb{I_n}-\frac1N \pmb{J}_{nn}\) and \(\sigma^2=S^2\).
Note that \(\pmb{D}^{-1}=(\pmb{I}-\frac1N \pmb{J})^{-1}=\pmb{I}+\frac1{N-n} \pmb{J}\), so
\[ \begin{aligned} &\hat{\mu} = [\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1}\pmb{y} =\\ &[\pmb{j}'(\pmb{I}+\frac1{N-n} \pmb{J})\pmb{j}]^{-}\pmb{j}'(\pmb{I}+\frac1{N-n} \pmb{J})\pmb{y} = \\ &[\pmb{j}'\pmb{j}+\frac1{N-n} \pmb{j}'\pmb{J}\pmb{j}]^{-}(\pmb{j'y}+\frac1{N-n}\pmb{j}' \pmb{J}\pmb{y}) = \\ &[n+\frac1{N-n} n^2]^{-}(n\bar{y}+\frac1{N-n}n^2\bar{y}) = \\ &[n+\frac1{N-n} n^2]^{-}(n+\frac1{N-n}n^2)\bar{y} = \bar{y}\\ \end{aligned} \]
so the BLUE is \(\bar{y}\)!
Also
\[ \begin{aligned} &var(\hat{\mu}) = \sigma^2[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1} =\\ &\sigma^2[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1} = \frac{N-n}{Nn}S^2 \end{aligned} \] \(\frac{N-n}{Nn}\) is called the finite population correction.
Say we have k experiments, each taking measurements for the same quantity \(\theta\). Experiment k has the unbiased estimator \(T_k\). Is it possible to combine these experiments into a super-experiment? This is part of a branch of statistics called meta analysis.
We can write the model
\[T_k=\theta+\epsilon_i\]
where \(cov(\pmb{\epsilon})=\sigma^2\pmb{D}\). The BLUE of the combined experiment is therefore given by
\[\pmb{\hat{\theta}}=[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-}\pmb{X}'\pmb{D}^{-1}\pmb{T}=[\pmb{j}'\pmb{D}^{-1}\pmb{j}]^{-}\pmb{j}'\pmb{D}^{-1}\pmb{T}\] with variance
\[var(\pmb{\hat{\theta}})=[\pmb{j}'\pmb{D}^{-1}\pmb{j}]^{-}\]
If it can be assumed that the experiments are independent we have \(\pmb{D}=diag(\sigma^2_1,..,\sigma^2_k)\), so
\[ \begin{aligned} &\pmb{\hat{\theta}} = [\pmb{j}'diag(1/\sigma^2_1,..,1/\sigma^2_k)\pmb{j}]^{-1}\pmb{j}'diag(1/\sigma^2_1,..,1/\sigma^2_k)\pmb{T} = \\ &[\sum_i 1/\sigma_i^2]^{-1}\sum_i T_i/\sigma_i^2 = \\ &\frac{\sum_i T_i/\sigma_i^2}{\sum_i 1/\sigma_i^2} = \frac{\sum_i w_i T_i}{\sum_i w_i} \end{aligned} \]
where the weights w are defined as before. The variance is
\[var(\pmb{\hat{\theta}})=\frac{1}{\sum_i 1/\sigma_i^2}\]
Say we have two such experiment, one with \(\sigma^2=1\), then the variance of the combined experiment is
curve(1/(1+1/x), 0, 10)
and so we always gain something but we gain the most if the second experiment has a small variance.
\[w_i=1/\sigma^2_i=1/(\sigma^2/n_i)=n_i/\sigma^2\]
and so
\[\pmb{\hat{\theta}} = \frac{\sum_i n_i/\sigma^2 T_i}{\sum_i n_i/\sigma^2} = \frac{\sum_i n_i T_i}{\sum_i n_i}\]
and finally if all the sample sizes were the same the BLUE is simply the mean of the Ti’s.
Say we have T1 and T2 with
\[\pmb{D} = \begin{pmatrix} \sigma_1^2 & \sigma_1\sigma_2\rho \\\sigma_1\sigma_2\rho & \sigma_2^2 \end{pmatrix}\]
then
\[\pmb{D}^{-1} =\frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \begin{pmatrix} \sigma_2^2 & -\sigma_1\sigma_2\rho \\ -\sigma_1\sigma_2\rho & \sigma_1^2 \end{pmatrix}\] \[ \begin{aligned} &\pmb{j}'\pmb{D}^{-1}\pmb{j} = \frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \begin{pmatrix} 1 & 1 \end{pmatrix}\begin{pmatrix} \sigma_2^2 -\sigma_1\sigma_2\rho \\ \sigma_1^2 -\sigma_1\sigma_2\rho \end{pmatrix} = \\ &\frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \left[\sigma_2^2 -\sigma_1\sigma_2\rho + \sigma_1^2 -\sigma_1\sigma_2\rho\right] = \\ &\frac{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho}{\sigma_1\sigma_2(1-\rho^2)} \\ \end{aligned} \]
and so
\[ \begin{aligned} &\pmb{\hat{\theta}}=[\pmb{j}'\pmb{D}^{-1}\pmb{j}]^{-}\pmb{j}'\pmb{D}^{-1}\pmb{T} = \\ &[\frac{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho}{\sigma_1\sigma_2(1-\rho^2)}]^{-1}\pmb{j}'\frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \begin{pmatrix} \sigma_2^2 & -\sigma_1\sigma_2\rho \\ -\sigma_1\sigma_2\rho & \sigma_1^2 \end{pmatrix}\pmb{T} = \\ &\frac{\sigma_1\sigma_2(1-\rho^2)}{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho}\frac1{\sigma_1^2\sigma_2^2(1-\rho^2)} \pmb{j}'\begin{pmatrix} \sigma_2^2T_1 -\sigma_1\sigma_2\rho T_2 \\ -\sigma_1\sigma_2\rho T_1 + \sigma_1^2 T_2\end{pmatrix} = \\ &\frac{(\sigma_2^2-\sigma_1\sigma_2\rho)T_1+(\sigma_1^2-\sigma_1\sigma_2\rho)T_2 }{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho} = \\ &\frac{a_2 T_1+ a_1 T_2}{a_1+a_2} \\ \end{aligned} \]
where \(a_i=\sigma_i^2-\sigma_1\sigma_2\rho\)
The variance of the combined estimator is
\[var(\pmb{\hat{\beta}})=[\pmb{X}'\pmb{D}^{-1}\pmb{X}]^{-} = \frac{\sigma_1\sigma_2(1-\rho^2)}{\sigma_1^2 + \sigma_2^2 -2\sigma_1\sigma_2\rho}\]
Let’s investigate this a bit. To do so we fix \(\sigma_1=\sigma_2=1\), so we have
\[var(\pmb{\hat{\beta}})= \frac{1-\rho^2}{2-2\rho}=\frac{(1-\rho)(1+\rho)}{2(1-\rho)}=\frac{1+\rho}{2}\]
Note that \(-1\le \rho\le 1\), so \(0\le \frac{1+\rho}{2}\le 1 = var(T_i)\), so there is always an improvement but the best case is two highly negatively correlated experiments! If the two experiments are independent \(\rho=0\) and the variance is 1/2.