Processing math: 51%

Sums of Squares, Mean and Variance of Quadratic Forms

Sums of Squares

Example (5.3.1)

Let xx=(x1,..,xn) be a random sample from some population with mean μ and standard deviation σ. Then the total sum of squares is given by ni=1x2i. Let ˉx=1nni=1xi be the sample mean, then

ni=1(xiˉx)2=ni=1(x2i2xiˉx+ˉx2)=ni=1x2ini=12xiˉx+ni=1ˉx2=ni=1x2i2ˉxni=1xi+nˉx2=ni=1x2i2ˉxnˉx+nˉx2=ni=1x2inˉx2

and so we find that the total sum of squares can be partitioned into a sum of squares about the mean and the sum of squares due to the mean:

ni=1x2i=ni=1(xiˉx)2+nˉx2

This can also be written as a quadratic form:

ni=1x2i=xxxx=xxIIxx

Recall that jj=(1...1) and

JJ=(1...11...1)

then

ˉx=1njjxx and

nˉx2=n(1njjxx)2=1nxxjjjjxx=1nxxJJxx=xx(1nJJ)xx

and so

ni=1x2i=xx(II1nJJ)xx+xx(1nJJ)xx

Theorem (5.3.2)

  1. II=(II1nJJ)+1nJJ
  2. II,II1nJJ,1nJJ are idempotent
  3. (II1nJJ)(1nJJ)=OO

proof

  1. obvious

(II1nJJ)2=(II1nJJ)(II1nJJ)=II1nJJ1nJJ+(1nJJ)2=II2nJJ+1n2JJJJ=II2nJJ+1n2(nJJ)=II2nJJ+1nJJ=II1nJJ

this proof also shows that (1nJJ)2=1nJJ.

(II1nJJ)(1nJJ)=1nJJ1nJJ=00

Mean and Variance of Quadratic Forms

Theorem (5.3.3)

If XX is a random vector with mean μμ and covariance matrix ΣΣ and if AA is a symmetric matrix of constants, then

E[XXAAXX]=tr(AAΣΣ)+μμAAμμ

proof

Note that XXAAXX is a scalar, so always XXAAXX=tr(XXAAXX)

ΣΣ=E[XXXX]μμμμE[XXXX]=ΣΣ+μμμμE[XXAAXX]=E[tr(XXAAXX)]= (by 4.2.11)E[tr(AAXXXX)]=tr(E[AAXXXX]=tr(AAE[XXXX]=tr(AA[ΣΣ+μμμμ])=tr(AAΣΣ+AAμμμμ)=tr(AAΣΣ)+tr(AAμμμμ)=tr(AAΣΣ)+tr(μμAAμμ)=tr(AAΣΣ)+μμAAμμ

Example (5.3.4)

Let XN(μ,σ2);AA=(a),a>0, then ΣΣ=(σ2)

E[xAx]=E[aX2]=aE[X2]=a(var(X)+E[X]2)=a(σ2+μ2)=aσ2+aμ2=tr(AAΣΣ)+μμAAμμ

Example (5.3.5)

Say XX=(XY)N((12),(3112)) and we want to find E[X2+2XY+2Y2].

Direct solution:

E[X2+2XY+2Y2]=E[X2]+2E[XY]+2E[Y2]=var(X)+E[X]2+2(cov(X,Y)+E[X]E[Y])+2(var(Y)+E[Y]2)=3+12+2(1+12)+2(2+22)=4+6+12=22

or using the formula above: say AA=(1112), then

XXAAXX=(XY)(1112)(XY)=(XY)(X+YX+2Y)=X(X+Y)+Y(X+2Y)=X2+2XY+2Y2AAΣΣ=(1112)(3112)=(4355)tr(AAΣΣ)=4+5=9μμAAμμ=(12)(1112)(12)=(12)(35)=13E[XXAAXX]=tr(AAΣΣ)+μμAAμμ=9+13=22

Example (5.3.5a)

Say XX=(X1...Xn) has a multivariate normal distribution with mean vector μμ and covariance matrix Σ. We want to find E[ni=1X2i]. Of course we have

E[ni=1X2i]=ni=1E[X2i]=ni=1(var(Xi)+E[Xi]2)=ni=1var(Xi)+ni=1E[Xi]2=ni=1σii+ni=1μ2i=tr(ΣΣ)+μμμμ=tr(IΣIΣ)+μIμμIμ

but ni=1X2i=XXXX=XIXXIX

Definition (5.3.6)

Let xx=(x1,..,xn) be a sample from a random vector XX . The sample variance is defined by

s2=1n1ni=1(xiˉx)2

By (5.3.1) we have

(n1)s2=xx(II1nJJ)xx

Say E[X1]=μ and var(X1)=σ2. In a random sample the Xi’s are independent and identically distributed, so E[XX]=μjj and cov(XX)=σ2II. Set

AA=II1nJJ,ΣΣ=σ2II and μμ=μjj, therefore

E[ni=1(XiˉX)2]=E[(n1)s2]=E[XX(II1nJJ)XX]=(by 5.3.2) tr[(II1nJJ)σ2II]+μjj(II1nJJ)μjj=σ2tr[II1nJJ]+μ2(jjjjjjjj1njjjj)=σ2(nnn)+μ2(n1nn2)=σ2(n1)

and so

E[s2]=1n1E[ni=1(xiˉx)2]=σ2

and we see that s2 is an unbiased estimator of σ2.

Theorem (5.3.7)

Let XXNp(μμ,ΣΣ), then the moment generating function of XXAAXX is given by

ψ(t)=|II2tAAΣΣ|1/2exp{μμ[II(II2tAAΣΣ)1]ΣΣ1]μμ/2}

proof

ψ(t)=c1..exp{txxAAxx}exp{(xxμμ)ΣΣ1(xxμμ)/2}dxx=ψ(tt)=c1..exp{[(xxμμ)ΣΣ1(xxμμ)txxAAxx]/2dxx=c1..exp{[xx(II2tAAΣΣ)ΣΣ1xx2μμΣΣ1xx+μμΣΣ1μμ]/2}dxx

where

c1=1/[(2π)p/2|ΣΣ|1/2]

Now if t is close to 0 II2tAAΣΣ is nonsingular. Let θθ=μμ(II2tAAΣΣ)1 and VV1=(II2tAAΣΣ)ΣΣ1, then we have

ψ(t)=c1c2..c3exp{(xxθθ)VV1(xxθθ)/2}dxx

where

c2=1/[(2π)p/2|VV|1/2]exp{(μμΣΣ1μμθθVV1θθ)/2}

and

c3=1/[(2π)p/2|VV|1/2]

The integral is one because it is integrating out a multivariate normal, and therefore

ψ(tt)=c1c2=1/[(2π)p/2|ΣΣ|1/2]1/[(2π)p/2|VV|1/2]exp{(μμΣΣ1μμθθVV1θθ)/2

and replacing the terms yields the result.

Example (5.3.7a)

Say XXNp(00,II) and let YY=pi=1X2i. Then YY=XXXX=XIXXIX, and so the mgf is given by

ψY(t)=|II2tAAΣΣ|1/2exp{μμ[II(II2tAAΣΣ)1]ΣΣ1]μμ/2}|II2tII|1/2exp{00[II(II2tII)1]]00/2}=|(12t)II|1/2=(12t)n/2

which is the mgf of χ2 distribution with n degrees of freedom.

Theorem (5.3.8)

Say XXNp(μμ,ΣΣ), then

var(XXAAXX)=2tr[(AAΣΣ)2]+4μμAAΣΣAAμμ

proof

From probability theory we know that for any rv Y with mgf ψ we have

E[Yk]=dkψ(t)dtk|t=0

Therefore we have

dlogψ(t)dt=ψ(t)ψ(t)d2logψ(t)dt2=ψ

By (5.3.7) we have

m(t) = \log \psi_{\pmb{X}'\pmb{A}\pmb{X}}(t)=-\frac12 \log \vert \pmb{C} \vert -\frac12 \pmb{\mu}'(\pmb{I}-\pmb{C}^{-1})\pmb{\Sigma}^{-1}\pmb{\mu}

where \pmb{C}=\pmb{I}-2t\pmb{A}\pmb{\Sigma}.

For the first term we find

\begin{aligned} &\frac{d^2\log \vert C\vert}{dt^2} = \\ &\frac{d}{dt}\frac{d \vert C \vert/dt}{\vert C \vert}=\\ &\frac{d^2 \vert C \vert/dt^2\vert C \vert-(d \vert C \vert/dt)^2}{\vert C \vert^2} = \\ &\frac1{\vert \pmb{C} \vert}\left[ \frac{d^2\vert \pmb{C} \vert}{dt^2} \right]-\frac1{\vert \pmb{C} \vert^2}\left[ \frac{d\vert \pmb{C} \vert}{dt} \right]^2 \end{aligned}

For the second term we have

Now \frac{\partial \pmb{C}}{\partial t}=-2\pmb{A}\pmb{\Sigma}. By (4.3.28) we have

\frac{\partial \pmb{A}^{-1}}{\partial x} = -\pmb{A}^{-1}\frac{\partial \pmb{A}}{\partial x}\pmb{A}^{-1} and so

\frac{\partial \pmb{C}^{-1}}{\partial t} = -\pmb{C}^{-1}\frac{\partial \pmb{C}}{\partial t}\pmb{C}^{-1}=2\pmb{C}^{-1}\pmb{A\Sigma}\pmb{C}^{-1}

For the second derivative we find

\begin{aligned} &\frac{\partial^2 \pmb{C}^{-1}}{\partial t^2} = \\ &\frac{d}{dt}\left\{2\pmb{C}^{-1}\pmb{A\Sigma}\pmb{C}^{-1}\right\} = \\ &2\frac{d}{dt}\left\{\pmb{C}^{-1}\right\}\pmb{A\Sigma}\pmb{C}^{-1} +2\pmb{C}^{-1}\pmb{A\Sigma}\left\{\pmb{C}^{-1}\right\} = \\ &4\pmb{C}^{-1}\pmb{A\Sigma}\pmb{C}^{-1}\pmb{A\Sigma}\pmb{C}^{-1} +4\pmb{C}^{-1}\pmb{A\Sigma}\pmb{C}^{-1}\pmb{A\Sigma}\pmb{C}^{-1} = \\ &\pmb{C}^{-1}\left\{\pmb{A\Sigma}\pmb{C}^{-1}\pmb{A\Sigma} +\pmb{A\Sigma}\pmb{C}^{-1}\pmb{A\Sigma}\right\}\pmb{C}^{-1} = \\ &8\pmb{C}^{-1}\left\{\pmb{A\Sigma}\pmb{C}^{-1}\pmb{A\Sigma}\right\}\pmb{C}^{-1} \\ \end{aligned}

so now we have

\begin{aligned} &m''(t) = \frac{d^2}{dt^2}\left\{ -\frac12 \ln \vert \pmb{C} \vert -\frac12 \pmb{\mu}'(\pmb{I}-\pmb{C}^{-1})\pmb{\Sigma}^{-1}\pmb{\mu}\right\} = \\ &-\frac12 \frac1{\vert \pmb{C} \vert}\left[ \frac{d^2\vert \pmb{C} \vert}{dt^2} \right]+\frac12 \frac1{\vert \pmb{C} \vert^2}\left[ \frac{d\vert \pmb{C} \vert}{dt} \right]^2 -\frac12\pmb{\mu}'\left(-8\pmb{C}^{-1}\left\{\pmb{A\Sigma}\pmb{C}^{-1}\pmb{A\Sigma}\right\}\pmb{C}^{-1}\right)\pmb{\Sigma}^{-1}\pmb{\mu} = \\ &-\frac12 \frac1{\vert \pmb{C} \vert}\left[ \frac{d^2\vert \pmb{C} \vert}{dt^2} \right]+\frac12 \frac1{\vert \pmb{C} \vert^2}\left[ \frac{d\vert \pmb{C} \vert}{dt} \right]^2 +4\pmb{\mu}'\pmb{C}^{-1}\left\{\pmb{A\Sigma}\pmb{C}^{-1}\pmb{A\Sigma}\right\}\pmb{C}^{-1}\pmb{\Sigma}^{-1}\pmb{\mu} \\ \end{aligned}

If t=0 we have \pmb{C}=\pmb{I} and the second term becomes 4\pmb{\mu}'\pmb{A\Sigma A}\pmb{\mu}, as required.

For the first term recall that the determinant of a matrix is equal to the product of the eigenvalues, so if the eigenvalues of \pmb{A}\pmb{\Sigma} are \lambda_1,..,\lambda_p we have

\begin{aligned} &\vert \pmb{C} \vert = \prod_{i=1}^p (1-t2\lambda_i) ; \vert \pmb{C} \vert_{t=0} =1\\ &1-2t\sum_{i=1}^p \lambda_i +4t^2\sum_{i\ne j}^p \lambda_i \lambda_j -+ ... (-1)^p 2^p t^p \prod_{i=1}^p \lambda_i \\ &\frac{d \vert \pmb{C} \vert}{dt} = -2\sum_{i=1}^p \lambda_i +8t\sum_{i\ne j}^p \lambda_i \lambda_j +\text{ higher order terms in t}\\ &\frac{d \vert \pmb{C} \vert}{dt}\vert_{t=0} = -2\sum_{i=1}^p \lambda_i = -2tr(\pmb{A}\pmb{\Sigma})\\ &\frac{d^2 \vert \pmb{C} \vert}{dt^2} = 8\sum_{i\ne j}^p \lambda_i \lambda_j +\text{ higher order terms in t}\\ &\frac{d^2 \vert \pmb{C} \vert}{dt^2}\vert_{t=0} =8\sum_{i\ne j}^p \lambda_i \lambda_j \end{aligned}

so the first term at t=0 is

\begin{aligned} &-\frac12 \frac1{\vert \pmb{C} \vert}\left[ \frac{d^2\vert \pmb{C} \vert}{dt^2}\vert_{t=0} \right]+\frac12 \frac1{\vert \pmb{C} \vert^2}\left[ \frac{d\vert \pmb{C} \vert}{dt}\vert_{t=0} \right]^2 = \\ &-\frac12 8\sum_{i\ne j}^p \lambda_i \lambda_j+\frac12 [-2tr(\pmb{A}\pmb{\Sigma})]^2 =\\ &2[tr(\pmb{A}\pmb{\Sigma})]^2-4\sum_{i\ne j}^p \lambda_i \lambda_j \end{aligned}

and it can be shown that

[tr(\pmb{A}\pmb{\Sigma})]^2-4\sum_{i\ne j}^p \lambda_i \lambda_j = tr([\pmb{A}\pmb{\Sigma}]^2)

Example (5.3.9)

Let X\sim N(\mu, \sigma^2); \pmb{A}=(a), a>0, then \pmb{\Sigma}=(\sigma^2). Let Z\sim N(0,1) and recall that Z^2\sim \chi^2(1), and therefore E[Z]=0, E[Z2]=1, E[Z3]=0 and E[Z4]=3 (=2+1=var + mean2 of a \chi^2(1)). So

\begin{aligned} &var(x'Ax) = var(aX^2)=a^2var(X^2)= a^2(E[X^4]-E[X^2]^2)\\ &E[X^2] = var(X)+E[X]^2 = \sigma^2+\mu^2 \\ &E[X^4] = E[(\sigma Z+\mu)^4]=\\ &\sigma^4 E[Z^4]+4\sigma^3 E[Z^3]\mu+6\sigma^2 E[Z^2]\mu^2+4\sigma E[Z]\mu^3+\mu^4 =\\ &\sigma^4\times 3+4\sigma^3\times 0\mu+6\sigma^2\times 1\mu^2+4\sigma\times0 \mu^3+\mu^4 =\\ &3\sigma^4+6\sigma^2 \mu^2+\mu^4\\ &var(x'Ax) = a^2\left(3\sigma^4+6\sigma^2 \mu^2+\mu^4-(\sigma^2+\mu^2)^2\right) = \\ &a^2\left(3\sigma^4+6\sigma^2 \mu^2+\mu^4-\sigma^4-2\sigma^2\mu^2-\mu^4\right) =\\ &a^2\left(2\sigma^4+4\sigma^2 \mu^2\right) = 2a^2\sigma^2\left(\sigma^2+2 \mu^2\right) \end{aligned}

but also

\begin{aligned} &2tr((\pmb{A}\pmb{\Sigma})^2)+4\pmb{\mu}'\pmb{A}\pmb{\Sigma}\pmb{A}\pmb{\mu} =\\ &2a^2\sigma^4+4\mu a\sigma^2a\mu = 2a^2\sigma^2(\sigma^2+2\mu^2) \end{aligned}

Example (5.3.10)

Say \pmb{X} = \begin{pmatrix} X \\ Y \end{pmatrix}\sim N( \begin{pmatrix} 1 \\ 2 \end{pmatrix}, \begin{pmatrix} 3&1 \\ 1&2 \end{pmatrix}) and we want to find var(X^2+2XY+2Y^2).

Let’s try this directly:

\begin{aligned} &var(X^2+2XY+2Y^2) = \\ &var(X^2)+var(2XY)+var(Y^2)+\\ &2\left[ cov(X^2 2XY)+cov(X^22Y^2)+cov(2XY2Y^2) \right] = \\ &var(X^2)+4var(XY)+var(Y^2)+\\ &2\left[ 2cov(X^3Y)+2cov(X^2Y^2)+4cov(XY^3) \right] \end{aligned}

and many of these terms are not easy to calculate. For example, to find var(XY) we would need the distribution of the product of two correlated normal random variables. But we can use the formula above:

recall that \pmb{A}=\begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix} and \pmb{A}\pmb{\Sigma}=\begin{pmatrix} 4&3 \\ 5&5 \end{pmatrix}\\, so

\begin{aligned} &(\pmb{A}\pmb{\Sigma})^2= \begin{pmatrix} 4&3 \\ 5&5 \end{pmatrix} \begin{pmatrix} 4&3 \\ 5&5 \end{pmatrix}= \begin{pmatrix} 31&27 \\ 45&40 \end{pmatrix}\\ &tr((\pmb{A}\pmb{\Sigma})^2)=31+40=71 \end{aligned}

\begin{aligned} &\pmb{\mu}'\pmb{A}\pmb{\Sigma}\pmb{A}\pmb{\mu} = \\ & \begin{pmatrix} 1& 2 \end{pmatrix} \begin{pmatrix} 4&3 \\ 5&5 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & 2 \end{pmatrix} \begin{pmatrix} 1\\ 2 \end{pmatrix}=\\ & \begin{pmatrix} 1& 2 \end{pmatrix} \begin{pmatrix} 7&10 \\ 10&15 \end{pmatrix} \begin{pmatrix} 1\\ 2 \end{pmatrix}=107 \\ &var(\pmb{X}'\pmb{A}\pmb{X}) = 2tr[(\pmb{A}\pmb{\Sigma})^2]+4\pmb{\mu}'\pmb{A}\pmb{\Sigma}\pmb{A}\pmb{\mu} = \\ &2\times 71+4\times 107 = 570 \end{aligned}

let’s check this with R:

library(mvtnorm)
x=rmvnorm(2e5, c(1, 2), matrix(c(3, 1, 1, 2), 2, 2))
var(x[,1]^2+2*x[, 1]*x[, 2]+2*x[, 2]^2)
## [1] 570.5925

Theorem (5.3.11)

Say \pmb{X}\sim N_p( \pmb{\mu}, \pmb{\Sigma}), then

cov(\pmb{X},\pmb{X}'\pmb{A}\pmb{X}) = 2\pmb{\Sigma}\pmb{A}\pmb{\mu}

proof

by definition

\begin{aligned} &cov(\pmb{X},\pmb{X}'\pmb{A}\pmb{X})=\\ &E[(\pmb{X}-E[\pmb{X}])(\pmb{X}'\pmb{A}\pmb{X}-E[\pmb{X}'\pmb{A}\pmb{X}])] = \\ &E[(\pmb{X}-\pmb{\mu})(\pmb{X}'\pmb{A}\pmb{X}-tr(\pmb{A}\pmb{\Sigma})-\pmb{\mu}'\pmb{A}\pmb{\mu})] = \\ &E[(\pmb{X}-\pmb{\mu})\left((\pmb{X}-\pmb{\mu})'\pmb{A}(\pmb{X}-\pmb{\mu})-tr(\pmb{A}\pmb{\Sigma})+2(\pmb{X}-\pmb{\mu})'\pmb{A}\pmb{\mu}\right)] = \\ &E[(\pmb{X}-\pmb{\mu})(\pmb{X}-\pmb{\mu})'\pmb{A}(\pmb{X}-\pmb{\mu})]-E[(\pmb{X}-\pmb{\mu})tr(\pmb{A}\pmb{\Sigma})+2E[(\pmb{X}-\pmb{\mu})(\pmb{X}-\pmb{\mu})'\pmb{A}\pmb{\mu}] = \\ &E[(\pmb{X}-\pmb{\mu})(\pmb{X}-\pmb{\mu})'\pmb{A}(\pmb{X}-\pmb{\mu})]-E[\pmb{X}-\pmb{\mu}]tr(\pmb{A}\pmb{\Sigma})+2E[(\pmb{X}-\pmb{\mu})(\pmb{X}-\pmb{\mu})']\pmb{A}\pmb{\mu} = \\ &\pmb{0}-\pmb{0}+2\pmb{\Sigma}\pmb{A}\pmb{\mu} \end{aligned}

the first term is 0 because it is the third central moment of a multivariate normal.

Corollary (5.3.12)

Say \pmb{X}\sim N_p( \pmb{\mu}, \pmb{\Sigma}), and let \pmb{B} be a n\times p matrix of constants, then

cov(\pmb{BX},\pmb{X}'\pmb{A}\pmb{X}) = 2\pmb{B\Sigma}\pmb{A}\pmb{\mu}

Theorem (5.3.13)

Let \pmb{V}= \begin{pmatrix} \pmb{X} \\ \pmb{Y} \end{pmatrix} be a partitioned in the usual way, with \pmb{\Sigma}_{xy} a p\times q matrix. Let \pmb{A} be a q\times p matrix of constants, then

E[\pmb{x'Ay}]=tr(\pmb{A\Sigma}_{xy})+ \pmb{\mu}'_y \pmb{A\mu}_x

proof similar to proof earlier

Example (5.3.13a)

Say E[X]=\mu_x, var(X)=\sigma_x^2, E[Y]=\mu_y, var(Y)=\sigma_y^2 and cor(X,Y)=\rho. If \pmb{A}=(a) then \pmb{x'Ay}=axy, \pmb{\Sigma}_{xy}=\sigma_x\sigma_y\rho and so

\begin{aligned} &E[aXY] =aE[XY]=c\left(cov(X,Y)+E[X]E[Y]\right)=a\sigma_x\sigma_y\rho+a\mu_x\mu_y \\ &tr(\pmb{A\Sigma}_{xy})+ \pmb{\mu}'_y \pmb{A\mu}_x = a\sigma_x\sigma_y\rho+a\mu_x\mu_y \end{aligned}

Definition (5.3.14)

We have a random sample

\begin{pmatrix} X \\ Y \end{pmatrix}= \left(\begin{pmatrix} x_1 \\ y_1 \end{pmatrix},..,\begin{pmatrix} x_n \\ y_n \end{pmatrix} \right)

from some bivariate distribution with means \mu_x and \mu_y, variances \sigma_x^2, \sigma_y^2. A standard estimator of the population covariance \pmb{\sigma}_{xy} is the sample covariance defined by

s_{xy}=\frac1{n-1}\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})

We find

\begin{aligned} &s_{xy}=\frac1{n-1}\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})=\\ &\frac1{n-1}\sum_{i=1}^n x_iy_i-n\bar{x}\bar{y}= \\ &\frac1{n-1}\pmb{x}'[\pmb{I}-(1/n)\pmb{J}]\pmb{y} \end{aligned}

(x_i, y_i) is independent of (x_j, y_j) if i\ne j, we can define a random vector \pmb{V} = \begin{pmatrix} X \\ Y \end{pmatrix} with E[\pmb{V}] = \begin{pmatrix} \mu_x\pmb{j} \\ \mu_y\pmb{j} \end{pmatrix} and covariance matrix

\pmb{\Sigma}= \begin{pmatrix} \sigma_x^2\pmb{I} &\sigma_{xy}\pmb{I} \\ \sigma_{xy}\pmb{I} & \sigma_y^2\pmb{I} \end{pmatrix}

Let \pmb{A}=\pmb{I}-(1/n)\pmb{J} and so

\begin{aligned} &E[\pmb{x}'[\pmb{I}-(1/n)\pmb{J}]\pmb{y}] = \\ &tr\left[(\pmb{I}-(1/n)\pmb{J})\sigma_{xy}\pmb{I}\right]+\mu_y\pmb{j}'(\pmb{I}-(1/n)\pmb{J})\mu_x\pmb{j} = \\ &\sigma_{xy}tr[\pmb{I}-(1/n)\pmb{J}]+\mu_x\mu_y(\pmb{j}'\pmb{j}-(1/n)\pmb{j}'\pmb{j}\pmb{j}'\pmb{j}) = \\ &\sigma_{xy}(n-1)+0=(n-1)\sigma_{xy} \end{aligned}

and so s_{xy} is an unbiased estimator of \sigma_{xy}.

Let’s check with R:

library(mvtnorm)
S=matrix(c(2, 0.5, 0.5, 4), 2, 2)
x=rmvnorm(1e5, c(1, 2), S)
cov(x)
##          [,1]     [,2]
## [1,] 1.996217 0.488235
## [2,] 0.488235 4.025144
sum( (x[,1]-mean(x[,1]))*(x[,2]-mean(x[,2])))/(nrow(x)-1)
## [1] 0.488235