Loading [MathJax]/jax/output/HTML-CSS/jax.js

Covariance and Correlation

Definition (1.8.1)

Say X and Y are two random variables. Then the covariance is defined by

cov(X,Y)=E[(XμX)(YμY)]

and the correlation of X and Y is defined by

ρXY=cor(X,Y)=cov(X,Y)/(σXσY)

Note

cov(X,X)=var(X) Note Obviously, if cov(X,Y)=0, then

ρXY=cor(X,Y)=cov(X,Y)/(σXσY)=0

as well.

As with the variance we have a simpler formula for actual calculations:

Theorem (1.8.2)

cov(X,Y)=E[XY]E[X]E[Y]

proof omitted

Example (1.8.3)

Let X and Y be the sum and absolute value of the difference of two rolls of a die. What is the covariance of X and Y?

So we have

μX=E[X]=21/36+32/36+...+121/36=7.0

μY=E[Y]=06/36+112/36+...+52/36=70/36

E[XY]=021/36+120/36+.220/36..+5120/36=490/36

and so

cov(X,Y)=E[XY]E[X]E[Y]=490/367.070/36=0

Note that in the example above we previously saw that X and Y are not independent, so we here have an example that a covariance of 0 does not imply independence! It does work the other way around, though:

Theorem (1.8.4)

If X and Y are independent, then cov(X,Y)=0.

proof (in the case of X and Y continuous):

E[XY]=xyf(x,y)d(x,y)=xyfX(x)fY(y)d(x,y)=(xfX(x)dx)(yfY(y)dy)=E[X]E[Y]cov(X,Y)=E[XY]E[X]E[Y]=E[X]E[Y]E[X]E[Y]=0


We saw above that E[X+Y]=E[X]+E[Y]. How about var(X+Y)?

Theorem (1.8.5)

var(X,Y)=var(X)+var(Y)+2cov(X,Y) proof

var(X+Y)=E[(X+Y)2](E[X+Y])2=E[X2+2XY+Y2](E[X]+E[Y])2=E[X2]+2E[XY]+E[Y2](E[X]2+2E[X]E[Y]+E[Y]2)=(E[X2]E[X]2])+(E[Y2]E[Y]2)+2(E[XY]E[X]E[Y])=var(X)+var(Y)+2cov(X,Y) and if XY we have

var(X+Y)=var(X)+var(Y)

Example (1.8.5a)

This theorem can be used for what are called variance reduction methods. As an example, say we want to find I=10exdx. Now this integral is of course easy:

10exdx=e1=1.71828

but let’s use it as an example anyway.

The idea of simulation is this: let XU[0,1], then

E[eX]=10ex×1dx=I so if we generate many Xi’s we should find 1neXiI.

n=100
y=exp(runif(n))
mean(y)
## [1] 1.736462

But this simulation has an error:

var(1neXi)=1nvar(eX1)=1n(01e2xdxI2)=1n(12(e21)(e1)2)=0.242/n

Now if this error is to large we can of course use a larger n. But what if in a more complicated case this is not possible? Instead we can do this: let UU[0,1] and set X1=eU, X2=e1U. Of course 1UU[0,1], so again E[X2]=E[e1U]=I. But

cov(X1,X2)=cov(eU,e1U)=E[eUe1U]E[eU]E[e1U]=e(e1)2=0.234

and so

var(eU+e1U2)=14var(eU)+14var(e1U)+12cov(eU,e1U)=140.242+140.242+12(0.234)=0.004var(1n/2n/2i=1eUi+e1Ui2)=2nvar(eU1+e1U12)=0.008n Let’s check:

c(var(y)/n, 0.242/n)
## [1] 0.002359735 0.002420000
u=runif(n/2)
z=(exp(u)+exp(1-u))/2
mean(z)
## [1] 1.706106
c(var(z)/(n/2), 0.008/n)
## [1] 4.981067e-05 8.000000e-05

and we actually needed only half as many uniform’s! Actually

0.242/0.008
## [1] 30.25

and so we would need 30 uniforms using the direct simulation for every one using this method to achieve the same precision.

Example (1.8.6)

Consider again the example from before: we have continuous rv’s X and Y with joint density f(x,y)=8xy, 0x<y1. Find the covariance and the correlation of X and Y.

We have seen before that fY(y)=4y3, 0<y<1, so

E[Y]=yfY(y)dy=10y4y3dy=4/5y5|10=4/5

Now

fX(x)=f(x,y)dy=1x8xydy=4xy2|1x=4x(1x2),0<x<1E[X]=10x4x(1x2)dx=104x24x4dx=43x345x5|10=4345=815 E[XY]=101xxy8xydydx=108x2(1xy2dy)dx=108x2(13y3|1x)dx=1083(x2x5)dx=89x349x5|10=8949=49

and so

cov(X,Y)=4/98/15×4/5=12/675

Also

E[X2]=10x24(xx3)dx=13var(X)=13(815)2=11225E[Y2]=10y24y3dy=23var(Y)=23(45)2=275ρ=cor(X,Y)=cov(X,Y)var(X),var(y)=12/67511/225×2/75=0.492

Example (1.8.6a)

Say X is a random variable with P(X=k)=13;x=1,2,3, and Y|X=xU[0,x]. Then

f(x,y)=fX(x)fY|X=x(y|x)=131xI[0,x](y)fY(y)=3x=1f(x,y)=13(I[0,1](y)+12I[0,2](y)+13I[0,3](y))E[X]=3x=1xfX(x)=3x=1x13=(1+2+3)13=2E[Y]=yfY(y)dy=3013(I[0,1](y)+12I[0,2](y)+13I[0,3](y))dy=13(10dy+1220dy+1330dy)=13(1+122+133)=1

E[XY]=3x=1xyf(x,y)dy=13(10ydy+21220ydy+31330ydy)=13(y22|10+y22|20+y22|30)=13(12+42+92)=73cov(X,Y)=E[XY]E[X]E[Y]=732×1=13 Let’s check:

n=1e4
x=sample(1:3, size=n,replace=TRUE)
y=runif(n,0,x)
c(mean(x), mean(y), mean(x*y), cov(x,y))
## [1] 2.0010000 0.9943981 2.3211551 0.3313976

Example (1.8.7)

say (X,Y) is a discrete rv with joint pdf f given by

f(0,0)=af(0,1)=bf(1,0)=cf(1,1)=d where a,b,c and d are numbers such that f is a pdf, that is a,b,c,d0 and a+b+c+d=1. Note that this is the most general case of a discrete random vector where X and Y just take two values.

What can be said in this generality?

Now the marginals of X and Y are given by

fX(0)=a+bfX(1)=c+d

fY(0)=a+c,fY(1)=b+d

so

EX=0×(a+b)+1×(c+d)=c+d

EY=0×(a+c)+1×(b+d)=b+d

also

EXY=0×0×a+1×0×b+0×1×c+1×1×d=d

and so

cov(X,Y)=d(c+d)(b+d)=dcbcdbdd2=dbc(c+b)dd2=dbc(1ad)dd2=dbcd+ad+d2d2=adbc
so X and Y are uncorrelated iff adbc=0.

Of course

|abcd|=adbc is the determinant of this matrix.

When are X and Y independent? For that we need

f(x,y)=fX(x)fY(y) for all x and y, so we need

a=(a+b)(a+c)b=(a+b)(b+d)c=(a+b)(b+d)d=(c+d)(b+d)

but a=(a+b)(a+c)=a2+(c+b)a+bc=a2+(1ad)a+bc=aad+bc

or

adbc=0

Similarly we find that each of the other three equations holds iff adbc=0. So

XY iff adbc=0

and here we have a case where XY iff cov(X,Y)=0.

Notice that if XY then rX+sY for any r,s with r0, so the above does not depend on the fact that X and Y take values 0 and 1, although the proof is much easier this way.


If you know cov(X,Y)=2.37, what does this tell you? Not much, really, except X and Y are not independent. But if I tell you cor(X,Y)=0.89, that tells you more:

Theorem (1.8.8)

  1. |ρXY|1
  2. ρXY=±1 iff there exist a0 and b such that P(X=aY+b)=1

proof

  1. Consider the function

h(t)=E[{(XμX)t+(YμY)}2]

Now h(t) is the expectation of a non-negative function, so h(t)0 for all t. Also

h(t)=E[(XμX)2t2+2(XμX)(YμY)t+(YμY)2]=E[(XμX)2]t2+2E[(XμX)(YμY)]t+E[(YμY)2]=var(X)t2+2cov(X,Y)+var(Y) Now because h(t)0 we know that either the parabola is entirely above the x-axis (h(t)>0) or just touches it (h(t)=0 for some t). In either case the discriminant can not be positive, and therefore

[2cov(X,Y)]24var(X)var(Y)0cov(X,Y)2var(X)var(Y)=cov(X,Y)2var(X)var(Y)1|cor(X,Y)|=|cov(X,Y)|var(X)var(Y)1

  1. Continuing with the argument above we see that |ρXY|=1 iff

cov(X,Y)2=var(X)var(Y)

that is if h(t) has a single root. But then

[(XμX)t+(YμY)]20

for all t and we have

h(t)=0 iff P([(XμX)t+(YμY)]2=0)=1

This is the same as

P((XμX))t+(YμY)=0)=1

so P(X=aY+b)=1 with a=t and b=μXt+μY, where t is the single root of h(t).

This theorem is also a direct consequence of a very famous inequality in mathematics. To state it in some generality we need the following

Definition (1.8.9)

Let V be some vector space. A mapping .,.:V2R is an inner product on V if for x,y,zV and aR

  1. x,y0
  2. x,y=¯y,x
  3. ax+y,z=ax,z+y,z

where the line denotes complex conjugate.

A vector space with an inner product is called an inner product space.

Often we also write x,x=||x||2 and then ||x|| is called the norm.

Example (1.8.10)

  1. Euclidean space Rn with x,y=xiyi

  2. the space of continuous functions C with

f,g=f(x)g(x)dx

Note that in an inner product space we have a version of the Pythagorean theorem: if x and y are such that

x,y=0

they are said to be orthogonal, and then we have

||x+y||2=x+y,x+y=x,x+x,y+y,x+y,y=||x||2+||y||2

Theorem (Cauchy-Schwartz)

say x and y are any two vectors of an inner product space, then

<x,y>≤||x||×||y||

and “=” holds iff x=ay+b for some a,bR.

The Cauchy-Schwartz inequality is one of the most important results in Mathematics. It has a great many consequences, for example the general formulation of the Heisenberg uncertainty principle in Quantum Mechanics is derived using the Cauchy–Schwarz inequality in the Hilbert space of quantum observables. For more on the inequality see https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality.

Example (1.8.11)

Let X and Y be some rv, and define <X,Y>=E[XY]. Then <X,Y> is an inner product on the space of square-integrable random variables. Moreover if E[X]=μ and E[Y]=ν by Cauchy-Schwartz

cov(X,Y)2=(E[(Xμ)(Yν)])2=(Xμ,Yν)2 Xμ,XμYν,Yν=E[(Xμ)2]E[(Yν)2]=var(X)var(Y)

and so

ρ2XY=cor(X,Y)2cov(X,Y)2var(X)var(Y)var(X)var(Y)var(X)var(Y)=1

and we have “=” iff P(aX+b=Y)=1


It is one of the fascinating features in mathematics that a theorem is sometimes easier to prove in greater generality:

proof (Cauchy-Schwartz)

Let x and y be two vectors in an inner product space. If y=0 the inequality is true (and an equation), so assume y0. Let

z=xx,yy,yy

then

z,y=xx,yy,yy,y=x,yx,yy,yy,y=0 and so z and y are orthogonal. Therefore by the Pythegorean theorem

||x||2=||z+x,yy,yy||2=||z||2+||x,yy,yy||2=||z||2+(x,yy,y)2||y||2(x,y)2||y||4||y||2=(x,y)2||y||2

and “=” iff z=0, in which case x=ay.

The heart of the proof of course is to consider the vector z, and it might not seem to you that this is an obvious thing to do. However, z defined as above appears in many different calculations in different areas of mathematics. For example it is part of the famous Gramm-Schmitt procedure in linear algebra. In essence it is the orthogonal projection of x onto the subspace spanned by y.


A little bit of care with covariance and correlation: they are designed to measure linear relationships. Consider the following:

Example (1.8.12)

let XU[1,1], and let Y=X2. Then E[X]=0 and

E[Y]=E[X2]=var(X)+(E[X])2=var(X)=(1(1))2/12=4/12=1/3 But

E[XY]=E[X3]=(14(1)4)/4/(1(1))=0

so cov(X,Y)=00×1/3=0.

So here is a case of two uncorrelated rv’s, but if we know X we know exactly what Y is! Correlation is only a sensible measure of linear relationships, not any others.

So as we said above, if you know cov(X,Y)=2.37, that does not tell you much. But if you know cor(X,Y)=0.89 and if there is a linear relationship between X and Y, we know that it is a strong positive one.

Theorem (1.8.13)

The correlation is scale-invariant, that is if a0 and b are any numbers, then

cor(aX+b,Y)=sign(a)cor(X,Y)

proof

cov(aX+b,Y)=E[(aX+a)Y]E[aX+b]E[Y]=E[aXY+bY](aE[X]+b)E[Y]=aE[XY]+bE[Y]aE[X]E[Y]bE[Y]=a(E[XY]E[Y])=a×cov(X,Y)cor(aX+b,Y)=cov(aX+b,Y)var(aX+b)var(Y)=acov(X,Y)a2var(X)var(Y)=sign(a)cor(X,Y)

so for example the correlation between the ocean temperature and the wind ispeed of a hurricane is the same whether the temperature is measured in Fahrenheit or Centigrade.