probability2.utf8.md

Expectations of Random Variables and Random Vectors

Expectation and Variance

Definition

The expectation (or expected value) of a random variable g(X) is defined by

Example

Say X is the sum of two dice. What is EX? What is EX²

we have

x	2	3	4	5	6	7	8	9	10	11	12
P(X=x)	1/36	2/36	3/36	4/36	5/36	6/36	5/36	4/36	3/36	2/36	1/36

EX=2×1/36+3×2/36+4×3/36+5×4/36+6×5/36+7×6/36+8×5/36+9×4/36+10×3/36+11×2/36+12×3/36 = 7

EX²=2²×1/36+3²×2/36+4²×3/36+5²×4/36+6²×5/36+7²×6/36+8²×5/36+9²×4/36+10²×3/36+11²×2/36+12²×3/36 = 54.83

Example

we roll fair die until the first time we get a six. What is the expected number of rolls?
We saw that f(x) = 1/6*(5/6)^x-1 if x\(\in\) {1,2,..}. Here we just have g(x)=x, so

How do we compute this sum? Here is a “standard” trick:

and so we find

This is a special example of a geometric rv, that is a discrete rv X with density f(x)=p(1-p)^x-1, x=1,2,.. Note that if we replace 1/6 above with p, we can show that

Example

X is said to have a uniform [A,B] distribution if f(x)=1/(B-A) for A<x<B, 0 otherwise. We denote a uniform [A,B] rv by X~U[A,B}
Find EX^k (this is called the k^th moment of X).

some special expectations are the mean of X defined by

μ=EX

and the variance defined by

σ² = Var(X) = E(X-μ)²

Related to the variance is the standard deviation σ, the square root of the variance.

Proposition

Let X and Y be rv’s and g and h functions on \(\mathbb{R}\). Then if X\(\perp\)Y we have

Eg(X)h(Y) = Eg(X)×Eh(Y)

There is a useful way to“link” probabilities and expectations is via the indicator function I_A defined as

because with this we have for a continuous r.v. X with density f:

Covariance and Correlation

Definition

The covariance of two r.v. X and Y is defined by

cov(X,Y)=E[(X-μ_X)(Y-μ_Y)]

The correlation of X and Y is defined by

cor(X,Y)=cov(X,Y)/(σ_Xσ_Y)

Note cov(X,X) = Var(X)

Proposition

cov(X,Y) = E(XY) - (EX)(EY)

Example

take the example of the sum and absolute value of the difference of two rolls of a die. What is the covariance of X and Y?
We have

X.Y	0	1	2	3	4	5
2	1	0	0	0	0	0
3	0	2	0	0	0	0
4	1	0	2	0	0	0
5	0	2	0	2	0	0
6	1	0	2	0	2	0
7	0	2	0	2	0	2
8	1	0	2	0	2	0
9	0	2	0	2	0	0
10	1	0	2	0	0	0
11	0	2	0	0	0	0
12	1	0	0	0	0	0

μ_X = EX = 2*1/36 + 3*2/36 + … + 12*1/36 = 7.0
μ_Y = EY = 0*6/36 + 1*12/36 + … + 5*2/36 = 70/36
EXY = 0*2*1/36 + 1*2*0/36 + .2*2*0/36.. + 5*12*0/36 = 490/36
and so cov(X,Y) = EXY-EXEY = 490/36 - 7.0*70/36 = 0

Note that we previously saw that X and Y are not independent, so we here have an example that a covariance of 0 does not imply independence! It does work the other way around, though:

Proposition

If X and Y are independent, then cov(X,Y) = 0

Example

say the rv (X,Y) has joint density f(x,y)=c if 0<x<y<1, 0 otherwise. Find the correlation of X and Y.

We have previously done a more general problem (with 0<x<y^p<1) and saw there that c=p+1=2 and f_Y(y)=2y, 0<y<1. Now

Proposition

Var(X+Y) = VarX + VarY + 2Cov(X,Y)

and if X\(\perp\)Y then

Var(X+Y) = VarX + VarY

This formula is the basis of what are called variance-reduction methods. If we can find a rv Y which is negatively correlated with X then the variance of X+Y might be smaller than the variance of X alone.

The above formulas generalize easily to more than two random variables

Proposition

Let X₁, .., X_n be rv, then

Example (The Matching Problem)

At a party n people put their hats in the center of the room where the hats are mixed together. Each person then randomly selects a hat. We are interested in the mean and the variance of of the number people who get their own hat.
Let this number be X, and let’s write X = X₁+..+X_n, where X_i is 1 if the k^th person selects their own hat and 0 if they do not.
Now the i^th person is equally likely to select any of the n hats, so P(X_i=1)=1/n, and so

EX_i = 0×(n-1)/n +1×1/n =1/n

There is an even simpler way of doing this: X_i is an indicator rv, and so EX_i = P(X_i=1) = 1/n
For the variance we have

EX_i² = 0²×(n-1)/n +1²×1/n =1/n

and so

VarX_i = EX_i² - (EX_i)² = 1/n - (1/n)² = (n-1)/n².

Also

EX_iX_j = P(X_i×X_j=1) =
P(X_i = 1, X_j=1) =
P(X_i = 1) × P(X_j=1|X_i=1) =
1/n × 1/(n-1)

again because X_iX_j is an indicator rv. So

Cov(X_iX_j) =

EX_iX_j - EX_i×EX_j =

1/n(n-1) - (1/n)² =

1/n²(n-1)

Finally

EX = E(X₁+..+X_n) = EX₁+..+EX_n = n×1/n =1

and

It is interesting to see that E[X] = Var(X) =1, independent of n! Let’s make sure we got this right and check a few simple cases:

n=1: there is just one person and one hat, so P(X=1)=1, so E[X]=1, but Var(X) = E[(X-1)²]=0, so actually something is wrong

What is it?

How about n=2? now there are two people and they either both get their hats or neither does (they get each others hats). So

P(X₁=0,X₂=0) = P(X₁=1,X₂=1) = 1/2

P(X₁=0,X₂=1) = P(X₁=0,X₂=1) = 0

E[X₁+X₂] = 2*P(X₁=1,X₂=1) = 2*1/2 = 1

E[(X₁+X₂)²] = 2²*P(X₁=1,X₂=1) = 4*1/2 = 2

Var(X₁+X₂) = E[(X₁+X₂)²] - E[X₁+X₂]² = 2-1² = 1

Conditional Expectation

Say X|Y=y is a conditional r.v. with density (pdf) f_X|Y=y. As was stated earlier, conditional rv’s are also just rv’s, so they have expectations as well, given by

We can think of π(y) = E[X|Y=y] as a function of y, that is if we know the joint density f(x,y) then for a fixed y we can compute π(y). But y is the realization of the random variable Y, so Z = π(Y) = E[X|Y] is a random variable as well.

Remember we do not have an object “X|Y”, only “X|Y=y”, but now we do have an object E[X|Y]

Example

An urn contains 2 white and 3 black balls. A random sample of size 2 is chosen. Let X be denote the number of white balls in the sample. An additional ball is drawn from the remaining six. Let Y equal 1 if the ball is white and 0 otherwise.
For example f(0,0) = P(X=0,Y=0) = 3/5*2/4*1/3 = 1/10. the complete density is given by:

y.x	0	1	2
0	1/10	2/5	1/10
1	1/5	1/5	0

The marginals are given by

x	0	1	2
fX(x)	3/10	3/5	1/10

and

y	0	1
fY(y)	3/5	2/5

The conditional distribution of X|Y=0 is

x	0	1	2
fX\|Y=0(x\|0)	1/6	2/3	1/6

and so E[X|Y=0] = 0*1/6+1*2/3+2*1/6 = 1.0

The conditional distribution of X|Y=1 is

y	0	1	2
fX\|Y=1(y\|1)	1/2	1/2	0

and so E[X|Y=1] = 0*1/2+1*1/2+2*0 = 1/2

Finally the conditional r.v. Z = E[X|Y] has density

z	1/2	1
fZ(z)	2/5	3/5

with this we can find E[Z] = E[E[X|Y]] = 1*3/5+1/2*2/5 = 4/5

Theorem

E[X] = E[E[X|Y]]

and

Var(X) = E[Var(X|Y)] + Var[E(X|Y)]

Example above

We found EZ = E[EX|Y]] = 4/5. Now E[X] = 0*3/10 + 1*3/5 + 2*1/10 = 4/5

Example

Let’s go back to the example above, where we had a continuous rv with joint pdf f(x,y) = 6x, 0≤x≤y≤1, 0 otherwise

Now

and

Example

Let’s have another look at the hat matching problem. Suppose that those choosing their own hats leave, while the others put their hats back into the center and do the exercise again. This process continuous until everybody has his or her own hat. Find E[R_n], where R_n is the number of rounds needed.

Given that in each round on average one person gets their own hat and then leaves, we might suspect that E[R_n]=n. Let’s proof this by induction on n.

Let n=1, that is there is just one person, and clearly they pick up their own hat, so E[R_n]=1

Assume E[R_k]=k \(\forall\) k<n. Let M be the number of matches that occur in the first round. Clearly M\(\in\) {0,1, .. ,n}

and we are done.

Note that we solved this problem without ever finding P(M=0), which is also a non-trivial problem. Here is how it is done:

Of course the probability depends on n, so let’s use the notation

p_n =P(M=0 | there are n people)

We will condition on whether or not the first person selects their own hat. Call the event “first person selects their own hat” E. Then

p_n = P(M=0) = P(M=0|E)P(E) + P(M=0|E^c)P(E^c)

Now

P(M=0|E)=0

because E means at least one person got their own hat, and so

p_n = P(M=0|E^c)(n-1)/n

E^c means the first person selects a hat that does not belong to them. So now there are n-1 hats in the center, one of which belongs to the first person. There are still n-1 people to pick hats, one of which has no hat in the center because the first person took it.

So P(M=0|E^c) is the probability of no matches when n-1 people select from a set of n-1 hats that does not contain the hat of one of them. This can happen in either of two mutually exclusive ways:

• there are no matches and the extra person does not select the extra hat. This has probability p_n-1. (think of the extra hat as to belong to the extra person)

• there are no matches and the extra person does select the extra hat. This has probability 1/(n-1)p_n-2 because the extra person needs to choose the extra hat (1/n-1), and then there are n-2 people and their n-2 hats left.

So now we have

P(M=0|E^c) = p_n-1 + 1/(n-1)p_n-2 and
p_n = (n-1)/np_n-1 + 1/np_n-2

This is called a recursive relationship. We can solve it via induction as follows:

so no matter how many people are present, there is always 63% chance of somebody getting their own hat.

Moment Generating and Characteristic Functions

Definition

The moment generating function of a rv X is defined by ψ(t) = E[e^tX]
The characteristic function of a rv X is defined by Φ(t) = E[e^itX]

Example

Let X be 1 with probability p and 0 with probability q=1-p. Find ψ(t)

ψ(t) = E[e^tX] = e^t0q + e^t1p = q+e^tp

X is called a Bernoulli rv. with success parameter p, denoted by X~Ber(p)

Example

Say X~Exp(λ). Find ψ(t)

which explains the name moment generating function, although actually finding moments in this way is like killing a fly with a cannon! The main usefulness of moment generating functions is as a tool for proving theorems, and usually uses the following

Proposition

Let X₁, .., X_n be independent random variables, with pdf’s (density’s) f₁(x), ..,f_n(x). Let X=∑X_i. Then
If in addition all the distributions are the same then

If X₁, .., X_n are independent random variables with the same pdf (density) f we say X₁, .., X_n are iid.

Example

Say X₁, .., X_n are iid Ber(p). Let X=∑X_i, then

Theorem

If ψ(t) exists in an open neighborhood of 0, then it determines f.

Proof much to complicated for us.

Example

A discrete rv. X is said to have a Poisson distribution with rate λ if it has density f(x) = (λ^x/x!)e^-λ, x=0, 1, 2, … Now

Say X₁, .., X_n are iid Poisson rate λ, then

but this is again the moment generating function of a Poisson rv, this one with rate nλ. So by the uniqueness theorem we have shown that ∑X_i has a Poisson distribution with rate nλ.

Proposition

Let X be a rv with mgf ψ(t). Then Y=aX+b has mgf ψ_Y(at)e^bt

Proof

X.Y	0	1	2	3	4	5
2	1	0	0	0	0	0
3	0	2	0	0	0	0
4	1	0	2	0	0	0
5	0	2	0	2	0	0
6	1	0	2	0	2	0
7	0	2	0	2	0	2
8	1	0	2	0	2	0
9	0	2	0	2	0	0
10	1	0	2	0	0	0
11	0	2	0	0	0	0
12	1	0	0	0	0	0

X.Y	0	1	2	3	4	5
2	1	0	0	0	0	0
3	0	2	0	0	0	0
4	1	0	2	0	0	0
5	0	2	0	2	0	0
6	1	0	2	0	2	0
7	0	2	0	2	0	2
8	1	0	2	0	2	0
9	0	2	0	2	0	0
10	1	0	2	0	0	0
11	0	2	0	0	0	0
12	1	0	0	0	0	0

X.Y	0	1	2	3	4	5
2	1	0	0	0	0	0
3	0	2	0	0	0	0
4	1	0	2	0	0	0
5	0	2	0	2	0	0
6	1	0	2	0	2	0
7	0	2	0	2	0	2
8	1	0	2	0	2	0
9	0	2	0	2	0	0
10	1	0	2	0	0	0
11	0	2	0	0	0	0
12	1	0	0	0	0	0