Covariance and Correlation

Definition (1.8.1)

Say \(X\) and \(Y\) are two random variables. Then the covariance is defined by

\[cov(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]\]

and the correlation of \(X\) and \(Y\) is defined by

\[\rho_{XY} = cor(X,Y) = cov(X,Y)/(\sigma_X \sigma_Y )\]

Note

\[cov(X,X) = var(X)\] Note Obviously, if cov(X,Y)=0, then

\(\rho_{XY} =cor(X,Y)=cov(X,Y)/(\sigma_X \sigma_Y )=0\)

as well.

As with the variance we have a simpler formula for actual calculations:

Theorem (1.8.2)

\[cov(X,Y) = E[XY] - E[X]E[Y]\]

proof omitted

Example (1.8.3)

Let \(X\) and \(Y\) be the sum and absolute value of the difference of two rolls of a die. What is the covariance of \(X\) and \(Y\)?

So we have

\[\mu_X = E[X] = 2*1/36 + 3*2/36 + ... + 12*1/36 = 7.0\]

\[\mu_Y = E[Y] = 0*6/36 + 1*12/36 + ... + 5*2/36 = 70/36\]

\[E[XY] = 0*2*1/36 + 1*2*0/36 + .2*2*0/36.. + 5*12*0/36 = 490/36\]

and so

\[cov(X,Y) = E[XY]-E[X]E[Y] = 490/36 - 7.0*70/36 = 0\]

Note that in the example above we previously saw that \(X\) and \(Y\) are not independent, so we here have an example that a covariance of 0 does not imply independence! It does work the other way around, though:

Theorem (1.8.4)

If \(X\) and \(Y\) are independent, then \(cov(X,Y) = 0\).

proof (in the case of \(X\) and \(Y\) continuous):

\[ \begin{aligned} &E[XY] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} xyf(x,y) d(x,y)=\\ &\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} xyf_X(x)f_Y(y) d(x,y)= \\ &\left( \int_{-\infty}^{\infty} xf_X(x) dx\right)\left( \int_{-\infty}^{\infty} yf_Y(y) dy\right) = E[X]E[Y]\\ &\\ &cov(X,Y)=E[XY]-E[X]E[Y]=E[X]E[Y]-E[X]E[Y]=0 \end{aligned} \]


We saw above that \(E[X+Y] = E[X] + E[Y]\). How about \(var(X+Y)\)?

Theorem (1.8.5)

\[var(X,Y) = var(X)+var(Y)+2cov(X,Y)\] proof

\[ \begin{aligned} &var(X+Y) = \\ &E\left[(X+Y)^2\right]-\left(E\left[X+Y\right]\right)^2 = \\ &E\left[X^2+2XY+Y^2\right]-\left(E[X]+E[Y]\right)^2 = \\ &E[X^2]+2E[XY]+E[Y^2]-\left(E[X]^2+2E[X]E[Y]+E[Y]^2\right) = \\ &\left(E[X^2]-E[X]^2]\right)+\left(E[Y^2]-E[Y]^2\right)+2\left(E[XY]-E[X]E[Y]\right)=\\ &var(X)+var(Y)+2cov(X,Y) \end{aligned} \] and if \(X \perp Y\) we have

\[var(X+Y) = var(X) + var(Y)\]

Example (1.8.5a)

This theorem can be used for what are called variance reduction methods. As an example, say we want to find \(I=\int_0^1 e^{x} dx\). Now this integral is of course easy:

\[\int_0^1 e^{x} dx=e-1=1.71828\]

but let’s use it as an example anyway.

The idea of simulation is this: let \(X\sim U[0,1]\), then

\[E[e^{X}]=\int_0^1 e^{x}\times1 dx = I\] so if we generate many \(X_i\)’s we should find \(\frac1n \sum e^{X_i}\approx I\).

n=100
y=exp(runif(n))
mean(y)
## [1] 1.736462

But this simulation has an error:

\[ \begin{aligned} &var(\frac1n \sum e^{X_i}) = \frac1n var(e^{X_1}) =\\ &\frac1n \left(\int^0_1 e^{2x} dx - I^2\right) =\\ &\frac1n \left(\frac12 (e^{2}-1) - (e-1)^2\right) =0.242/n\\ \end{aligned} \]

Now if this error is to large we can of course use a larger n. But what if in a more complicated case this is not possible? Instead we can do this: let \(U\sim U[0,1]\) and set \(X_1=e^U\), \(X_2=e^{1-U}\). Of course \(1-U\sim U[0,1]\), so again \(E[X_2]=E[e^{1-U}]=I\). But

\[ \begin{aligned} &cov(X_1,X_2) = cov(e^U,e^{1-U})= \\ &E[e^Ue^{1-U}]-E[e^U]E[e^{1-U}] = \\ & e-(e-1)^2 = -0.234\\ \end{aligned} \]

and so

\[ \begin{aligned} &var(\frac{e^U+e^{1-U}}2) = \\ &\frac14 var(e^U)+\frac14 var(e^{1-U})+\frac12 cov(e^U,e^{1-U}) = \\ &\frac14 0.242+\frac14 0.242+\frac12(-0.234) = 0.004\\ &var\left(\frac1{n/2}\sum_{i=1}^{n/2} \frac{e^{U_i}+e^{1-U_i}}2\right) = \\ &\frac2{n} var\left( \frac{e^{U_1}+e^{1-U_1}}2\right) = \frac{0.008}{n}\\ \end{aligned} \] Let’s check:

c(var(y)/n, 0.242/n)
## [1] 0.002359735 0.002420000
u=runif(n/2)
z=(exp(u)+exp(1-u))/2
mean(z)
## [1] 1.706106
c(var(z)/(n/2), 0.008/n)
## [1] 4.981067e-05 8.000000e-05

and we actually needed only half as many uniform’s! Actually

0.242/0.008
## [1] 30.25

and so we would need 30 uniforms using the direct simulation for every one using this method to achieve the same precision.

Example (1.8.6)

Consider again the example from before: we have continuous rv’s \(X\) and \(Y\) with joint density \(f(x,y)=8xy\), \(0 \le x<y \le 1\). Find the covariance and the correlation of \(X\) and \(Y\).

We have seen before that \(f_Y(y)=4y^3\), \(0<y<1\), so

\[E[Y]= \int_{- \infty}^\infty yf_Y (y)dy = \int _0^1y4y^3dy = 4/5y^5|_0 ^1 = 4/5\]

Now

\[ \begin{aligned} &f_X(x) = \int_{-\infty}^{\infty}f(x,y)dy = \int_x^18xy dy=\\ &4xy^2|_x^1 = 4x(1-x^2),0<x<1\\ &E[X] =\int_0^1 x4x(1-x^2)dx=\int_0^1 4x^2-4x^4dx= \\ &\frac43x^3-\frac45 x^5|_0^1 = \frac43-\frac45=\frac8{15} \end{aligned} \] \[ \begin{aligned} &E[XY] =\int_0^1\int_x^1 xy8xy dydx=\\ &\int_0^1 8x^2 \left(\int_x^1 y^2 dy\right)dx=\\ &\int_0^1 8x^2 \left(\frac13 y^3|_x^1\right) dx=\\ &\int_0^1 \frac83(x^2-x^5) dx=\\ &\frac89x^3-\frac49x^5|_0^1=\frac89-\frac49=\frac49 \end{aligned} \]

and so

\[cov(X,Y)=4/9-8/15\times 4/5 = 12/675\]

Also

\[ \begin{aligned} &E[X^2] =\int_0^1 x^24(x-x^3)dx=\frac13 \\ &var(X) =\frac13-(\frac8{15})^2=\frac{11}{225} \\ &E[Y^2] =\int_0^1y^24y^3dy=\frac23 \\ &var(Y) =\frac23-(\frac45)^2=\frac2{75}\\ &\rho=cor(X,Y)=\frac{cov(X,Y)}{\sqrt{var(X),var(y)}}=\frac{12/675}{\sqrt{11/225\times2/75}}=0.492 \end{aligned} \]

Example (1.8.6a)

Say X is a random variable with \(P(X=k)=\frac13;x=1,2,3\), and \(Y|X=x\sim U[0,x]\). Then

\[ \begin{aligned} &f(x,y) = f_X(x)f_{Y|X=x}(y|x) = \frac13\frac1xI_{[0,x]}(y)\\ &f_Y(y) = \sum_{x=1}^3 f(x,y) = \frac13\left(I_{[0,1]}(y)+\frac12I_{[0,2]}(y)+\frac13I_{[0,3]}(y)\right)\\ &E[X] = \sum_{x=1}^3 xf_X(x) = \sum_{x=1}^3 x\frac13 = (1+2+3)\frac13=2\\ &E[Y] = \int_{-\infty}^{\infty} yf_Y(y)dy=\\ &\int_0^3 \frac13\left(I_{[0,1]}(y)+\frac12I_{[0,2]}(y)+\frac13I_{[0,3]}(y)\right)dy=\\ &\frac13\left(\int_0^1 dy +\frac12\int_0^2dy+\frac13\int_0^3dy\right)=\\ &\frac13\left(1+ \frac12 2+\frac13 3\right)=1\\ \end{aligned} \]

\[ \begin{aligned} &E[XY]=\sum_{x=1}^3\int_{-\infty}^{\infty}xy f(x,y)dy=\\ &\frac13\left(\int_{0}^{1}y dy+2\frac12\int_{0}^{2}y dy+3\frac13\int_{0}^{3}y dy\right)=\\ &\frac13\left(\frac{y^2}2|_{0}^{1}+\frac{y^2}2|_{0}^{2}+\frac{y^2}2|_{0}^{3}\right)=\\ &\frac13\left(\frac{1}2+\frac{4}2+\frac{9}2\right)=\frac{7}{3}\\ &cov(X,Y)=E[XY]-E[X]E[Y]=\frac{7}{3}-2\times 1=\frac{1}{3} \end{aligned} \] Let’s check:

n=1e4
x=sample(1:3, size=n,replace=TRUE)
y=runif(n,0,x)
c(mean(x), mean(y), mean(x*y), cov(x,y))
## [1] 2.0010000 0.9943981 2.3211551 0.3313976

Example (1.8.7)

say \((X,Y)\) is a discrete rv with joint pdf \(f\) given by

\[ \begin{aligned} &f(0,0) =a \\ &f(0,1) =b \\ &f(1,0) =c \\ &f(1,1) =d \end{aligned} \] where a,b,c and d are numbers such that f is a pdf, that is \(a,b,c,d\ge 0\) and \(a+b+c+d=1\). Note that this is the most general case of a discrete random vector where \(X\) and \(Y\) just take two values.

What can be said in this generality?

Now the marginals of \(X\) and Y are given by

\[f_X(0)=a+b\text{, }f_X(1)=c+d\]

\[f_Y(0)=a+c, f_Y(1)=b+d\]

so

\[EX = 0\times (a+b)+1\times(c+d) = c+d\]

\[EY = 0\times(a+c)+1\times(b+d) = b+d\]

also

\[EXY = 0 \times 0 \times a + 1 \times0 \times b + 0 \times1 \times c + 1 \times1 \times d = d\]

and so

\[ \begin{align} &cov(X,Y) = \\ & d-(c+d)(b+d) = \\ & d-cb-cd-bd-d^2 = \\ & d-bc-(c+b)d-d^2 = \\ & d-bc-(1-a-d)d-d^2 = \\ & d-bc-d+ad+d^2-d^2 = \\ & ad-bc \end{align} \]
so \(X\) and \(Y\) are uncorrelated iff \(ad-bc=0\).

Of course

\[ \begin{vmatrix} a & b \\ c & d \end{vmatrix}=ad-bc \] is the determinant of this matrix.

When are \(X\) and Y independent? For that we need

\[f(x,y)=f_X(x)f_Y(y)\] for all x and y, so we need

\[ \begin{aligned} &a=(a+b)(a+c) \\ &b=(a+b)(b+d) \\ &c=(a+b)(b+d) \\ &d=(c+d)(b+d) \end{aligned} \]

but \[ \begin{aligned} &a = (a+b)(a+c) = \\ &a^2+(c+b)a+bc = \\ &a^2+(1-a-d)a+bc = \\ &a-ad+bc \end{aligned} \]

or

\[ad-bc=0\]

Similarly we find that each of the other three equations holds iff \(ad-bc=0\). So

\[X \perp Y \text{ iff } ad-bc=0\]

and here we have a case where \(X \perp Y\) iff \(cov(X,Y)=0\).

Notice that if \(X \perp Y\) then \(rX+s \perp Y\) for any r,s with \(r\ne 0\), so the above does not depend on the fact that \(X\) and \(Y\) take values 0 and 1, although the proof is much easier this way.


If you know \(cov(X,Y)=2.37\), what does this tell you? Not much, really, except \(X\) and \(Y\) are not independent. But if I tell you \(cor(X,Y)=0.89\), that tells you more:

Theorem (1.8.8)

  1. \(|\rho_{XY} | \le 1\)
  2. \(\rho_{XY} = \pm 1\) iff there exist \(a \ne 0\) and b such that \(P(X=aY+b)=1\)

proof

  1. Consider the function

\[h(t) = E\left[\left\{(X- \mu_X )t+(Y- \mu_Y )\right\}^2\right]\]

Now \(h(t)\) is the expectation of a non-negative function, so \(h(t) \ge 0\) for all t. Also

\[ \begin{aligned} &h(t) = E\left[(X- \mu_X )^2t^2+2(X- \mu_X )(Y- \mu_Y )t+(Y- \mu_Y )^2\right]=\\ &E\left[(X- \mu_X )^2\right]t^2+2E\left[(X- \mu_X )(Y- \mu_Y )\right]t+E\left[(Y- \mu_Y )^2\right] = \\ &var(X)t^2+2cov(X,Y)+var(Y) \end{aligned} \] Now because \(h(t)\ge 0\) we know that either the parabola is entirely above the x-axis (\(h(t)>0\)) or just touches it (\(h(t)=0\) for some t). In either case the discriminant can not be positive, and therefore

\[ \begin{aligned} &[2cov(X,Y)]^2-4var(X)var(Y)\le 0 \\ &\\ &cov(X,Y)^2\le var(X)var(Y) = \\ & \frac{cov(X,Y)^2}{var(X)var(Y)} \le 1 \\ &\\ &|cor(X,Y)|= \frac{|cov(X,Y)|}{\sqrt{var(X)var(Y)}}\le 1 \end{aligned} \]

  1. Continuing with the argument above we see that \(|\rho_{XY} |=1\) iff

\[cov(X,Y)^2= var(X)var(Y)\]

that is if h(t) has a single root. But then

\[\left[(X- \mu _X )t+(Y- \mu _Y )\right]^2 \ge 0\]

for all t and we have

\[h(t)=0 \text{ iff }P\left(\left[(X- \mu _X )t+(Y- \mu _Y )\right]^2=0\right)=1\]

This is the same as

\[P\left((X- \mu _X) )t+(Y- \mu _Y )=0\right)=1\]

so \(P(X=aY+b)=1\) with \(a=-t\) and \(b=\mu_X t+\mu_Y\), where \(t\) is the single root of \(h(t)\).

This theorem is also a direct consequence of a very famous inequality in mathematics. To state it in some generality we need the following

Definition (1.8.9)

Let \(V\) be some vector space. A mapping \(\langle .,.\rangle :V^2\rightarrow \mathbb{R}\) is an inner product on V if for \(x,y,z \in V\) and \(a \in \mathbb{R}\)

  1. \(\langle x,y\rangle \ge 0\)
  2. \(\langle x,y\rangle =\overline{\langle y,x\rangle }\)
  3. \(\langle ax+y, z\rangle =a\langle x,z\rangle +\langle y,z\rangle\)

where the line denotes complex conjugate.

A vector space with an inner product is called an inner product space.

Often we also write \(\langle x,x\rangle=||x||^2\) and then \(||x||\) is called the norm.

Example (1.8.10)

  1. Euclidean space \(\mathbb{R}^n\) with \(\langle x,y\rangle= \sum x_i y_i\)

  2. the space of continuous functions \(C\) with

\[\langle f,g\rangle= \int f(x)g(x)dx\]

Note that in an inner product space we have a version of the Pythagorean theorem: if x and y are such that

\[\langle x,y\rangle=0\]

they are said to be orthogonal, and then we have

\[ \begin{aligned} &||x+y||^2 = \\ &\langle x+y,x+y\rangle = \\ &\langle x,x\rangle +\langle x,y\rangle +\langle y,x\rangle +\langle y,y\rangle = \\ &||x||^2 +||y||^2 \end{aligned} \]

Theorem (Cauchy-Schwartz)

say x and y are any two vectors of an inner product space, then

\[<x,y> \le ||x||\times||y||\]

and “=” holds iff x=ay+b for some \(a,b \in \mathbb{R}\).

The Cauchy-Schwartz inequality is one of the most important results in Mathematics. It has a great many consequences, for example the general formulation of the Heisenberg uncertainty principle in Quantum Mechanics is derived using the Cauchy–Schwarz inequality in the Hilbert space of quantum observables. For more on the inequality see https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality.

Example (1.8.11)

Let \(X\) and Y be some rv, and define \(<X,Y>=E[XY]\). Then \(<X,Y>\) is an inner product on the space of square-integrable random variables. Moreover if \(E[X]=\mu\) and \(E[Y]=\nu\) by Cauchy-Schwartz

\[ \begin{aligned} &cov(X,Y)^2 = \\ &(E[(X-\mu)(Y-\nu)])^2 = \\ &\left(\langle X- \mu ,Y-\nu\rangle\right)^2 \text{ } \le \\ &\langle X- \mu ,X- \mu \rangle\langle Y-\nu,Y-\nu\rangle = \\ &E[(X- \mu)^2]E[(Y-\nu)^2] =\\ &var(X)var(Y) \\ \end{aligned} \]

and so

\[ \begin{aligned} &\rho_{XY}^2 = cor(X,Y)^2 \\ &\frac{cov(X,Y)^2}{var(X)var(Y)} \le \\ &\frac{var(X)var(Y)}{var(X)var(Y)} = 1\\ \end{aligned} \]

and we have “=” iff \(P(aX+b=Y)=1\)


It is one of the fascinating features in mathematics that a theorem is sometimes easier to prove in greater generality:

proof (Cauchy-Schwartz)

Let \(x\) and \(y\) be two vectors in an inner product space. If \(y=0\) the inequality is true (and an equation), so assume \(y \ne 0\). Let

\[z=x-\frac{\langle x,y\rangle}{\langle y,y \rangle}y\]

then

\[ \begin{aligned} &\langle z,y \rangle = \langle x-\frac{\langle x,y\rangle}{\langle y,y \rangle}y,y \rangle = \\ &\langle x,y \rangle - \frac{\langle x,y\rangle}{\langle y,y \rangle}\langle y,y \rangle = 0 \end{aligned} \] and so \(z\) and \(y\) are orthogonal. Therefore by the Pythegorean theorem

\[ \begin{aligned} &||x||^2 = ||z+\frac{\langle x,y\rangle}{\langle y,y \rangle}y ||^2 =\\ &||z||^2+||\frac{\langle x,y\rangle}{\langle y,y \rangle} y ||^2 = \\ &||z||^2+\left(\frac{\langle x,y\rangle}{\langle y,y \rangle}\right)^2|| y ||^2 \ge \\ &\frac{(\langle x,y\rangle)^2}{||y||^4}|| y ||^2=\frac{(\langle x,y\rangle)^2}{||y||^2} \end{aligned} \]

and “=” iff \(z=0\), in which case \(x=ay\).

The heart of the proof of course is to consider the vector \(z\), and it might not seem to you that this is an obvious thing to do. However, \(z\) defined as above appears in many different calculations in different areas of mathematics. For example it is part of the famous Gramm-Schmitt procedure in linear algebra. In essence it is the orthogonal projection of \(x\) onto the subspace spanned by \(y\).


A little bit of care with covariance and correlation: they are designed to measure linear relationships. Consider the following:

Example (1.8.12)

let \(X \sim U[-1,1]\), and let \(Y=X^2\). Then \(E[X]=0\) and

\[ \begin{aligned} &E[Y] = E[X^2] = \\ &var(X)+(E[X])^2 = \\ &var(X) = (1-(-1))^2/12 = 4/12 = 1/3 \end{aligned} \] But

\[E[XY] = E[X^3] = (1^4-(-1)^4)/4/(1-(-1)) = 0\]

so \(cov(X,Y)=0-0\times 1/3 = 0\).

So here is a case of two uncorrelated rv’s, but if we know \(X\) we know exactly what \(Y\) is! Correlation is only a sensible measure of linear relationships, not any others.

So as we said above, if you know \(cov(X,Y)=2.37\), that does not tell you much. But if you know \(cor(X,Y)=0.89\) and if there is a linear relationship between \(X\) and \(Y\), we know that it is a strong positive one.

Theorem (1.8.13)

The correlation is scale-invariant, that is if \(a \ne 0\) and b are any numbers, then

\[cor(aX+b,Y)=sign(a)cor(X,Y)\]

proof

\[ \begin{aligned} &cov(aX+b,Y) = E \left[ (aX+a)Y \right] - E[aX+b]E[Y] = \\ &E \left[ aXY+bY \right] - (aE[X]+b)E[Y] = \\ &aE[XY]+bE[Y] -aE[X]E[Y]-bE[Y] = \\ &a\left(E[XY]-E[Y]\right)=a \times cov(X,Y)\\ &cor(aX+b,Y)= \frac{cov(aX+b, Y)}{\sqrt{var(aX+b)var(Y)}} = \\ &\frac{acov(X, Y)}{\sqrt{a^2var(X)var(Y)}} =\\ &sign(a)cor(X,Y) \end{aligned} \]

so for example the correlation between the ocean temperature and the wind ispeed of a hurricane is the same whether the temperature is measured in Fahrenheit or Centigrade.