A random variable (r.v.) X is set-valued function from the sample space into \(\mathbb{R}\). For any set of real numbers \(A \subseteq \mathbb{R}\) we define the probability
\[P(X \in A) = P(X^{-1}(A))\]
where \(X^{-1}(A)\) is the set of all points in S such that X maps the points into A.
Say we flip a fair coin three times. Let X be the number of “heads” in these three flips.
Now S=({H,H,H}, (H,H,T), .., (T,T,T)}.
X maps S into \(\mathbb{R}\), for example X({(H,H,H)})=3 and X({(H,H,T)})=2.
What is P(X=2)?
P(X=2) =
P(X-1(2)) =
P( all the outcomes in S that are mapped onto 2 ) =
P({(H,H,T), (H,T,H), (T,H,H)} = 3/8
There are two basic types of r.v.’s:
Consider the following experiment: we randomly select a point in the interval [A,B] for some A<B. We allow all points in [A,B], so X takes uncountably many values, and therefore is a continuous random variable. By “randomly” we mean that the probability for a chosen point to be in some interval depends only on the length of the interval. Let X be the point chosen. Clearly
\(1 = P(A<X<B)\)
Let \(A<a<b<B\). Now the interval (a,b) has length b-a, the interval (A,B) has length B-A, and we have
\[b-a = [(b-a)/(B-A)]*(B-A) = c(B-A)\]
where \(c=[(b-a)/(B-A)]\). Therefore
\[P(a<X<b) = cP(A<X<B) = c*1 = (b-a)/(B-A)\]
This is a standard random variable called the uniform distribution. We often use the following notation:
\[X \sim U[A,B]\]
There are some technical difficulties when defining a r.v. on a sample space like \(\mathbb{R}\), it turns out to be impossible to define it for every subset of \(\mathbb{R}\) without getting logical contradictions. That such sets actually exist may be a bit of a surprise. For more on this topic see https://en.wikipedia.org/wiki/Non-measurable_set.
The solution is to define a \(\sigma\)-algebra on the sample space and then define X only on that \(\sigma\)-algebra.
The most commonly used \(\sigma\)-algebra is the Borel \(\sigma\)-algebra, which is the union and intersection of all intervals of the type \((a, b), (a, b], [a, b)\) and \([a, b]\), where a and b can be \(\pm \infty\).
All of this belongs to the branch of mathematics called measure theory. In what follows we will (mostly) ignore these technical difficulties.
In the example of the uniform random variable above we defined probabilities only for intervals. It turns out that this is all that is needed. In fact the set of all unions and intersections of intervals forms a \(\sigma\)-algebra on the real line. There are however also sets on the real line that can not be expressed as the union and intersection of intervals. For those sets it is not possible to define a probability.
Almost everything to do with r.v.’s has to be done twice, once for discrete and once for continuous r.v.’s. This separation is only artificial, it goes away once a more general definition of “integral” is used (Riemann-Stilties or Lebesgue).
The (cumulative) distribution function (cdf) of a r.v. X is defined by
\[F(x)=P(X \le x) \text{ for all } x \in \mathbb{R}\]
\(X \sim U[A,B]\)
\(x<A\): \(F(x) = P(X\le x) = 0\)
\(A<x<B: F(x) = P(X\le x) = P(A\le X\le x) = (x-A)/(B-A)\)
\(x>B: F(x) = P(X\le x) = P(A\le X\le B) = 1\)
Let F be the cdf of some random variable X. Then
cdf’s are standard functions on \(\mathbb{R}\)
\(0 \le F(x) \le 1\)
cdf’s are non-decreasing
cdf’s are right-continuous
\[\lim_{x \rightarrow - \infty} F(x) = 0\]
\[\lim_{x \rightarrow \infty} F(x) = 1\]
proof
The defining feature of a function (as opposed to a relation) is that any value x is mapped to no more than one y. In other words f(x) is unique. Now probabilities are unique so \(F(x) = P(X \le x) = P(\{\omega \in S:X(\omega) \le x\})\) is unique.
\(0 \le P(X \le x) = P(\{\omega \in S:X(\omega) \le x\}) \le 1\) from axiom 1
\[\{\omega \in S:X(\omega) \le x\} \subseteq \{\omega \in S:X(\omega ) \le y\}\]
and so
\[F(x) = P(X \le x) \le P(X \le y) = F(y)\]
Note that for any real number x and any h>0 the intervals \((-\infty, x]\) and \((x,x+h])\) are disjoint, so the sets \(\{\omega \in S: X(\omega)\le x\}\) an \(\{\omega \in S: x< X(\omega)\le x+h\}\) are disjoint. Now
\[ \begin{aligned} &\lim_{h\downarrow 0} F(x+h) = \\ &\lim_{h\downarrow 0} P(X\le x+h) = \\ &\lim_{h\downarrow 0} P\left(\left\{\omega: X(\omega)\le x+h\right\}\right) = \\ &\lim_{h\downarrow 0} \left[P\left(\left\{\omega: X(\omega)\le x\right\}\right)+P\left(\left\{\omega: x<X(\omega)\le x+h\right\}\right) \right]= \\ &P\left(\left\{\omega: X(\omega)\le x\right\}\right)+\lim_{h\downarrow 0} P\left(\left\{\omega: x<X(\omega)\le x+h\right\}\right)= \\ &F(x)+\lim_{h\downarrow 0} P\left(\left\{\omega: x<X(\omega)\le x+h\right\}\right)= \\ \end{aligned} \] Define the set \(A_{x,h}=\left\{\omega: x<X(\omega)\le x+h\right\}\). Note that if \(h<g\) \(A_{x,h}\subseteq A_{x,g}\), and so the sequence is decreasing. Therefore by (1.2.8)
\[\lim_{h\downarrow 0} A_{x,h}=\bigcap_h A_{x,h} = \emptyset\]
The intersection is the empty set because otherwise there exists y such \(y \in A_{x,h}\) for all h, that is \(x<y \le x+h\), a contradiction.
Let X be a continuous random variable, then for any \(-\infty<a<b<\infty\) we have
\[ \begin{aligned} &P(a\le X \le b) = \\ &P(a\le X < b) = \\ &P(a< X \le b) = \\ &P(a< X < b) \end{aligned} \]
proof
consider the sets \(A_n=\{\omega: X\le b-\frac1{n}\}\). Now \(A_n\) is an increasing sequence and \(\lim_{n \rightarrow \infty} A_n = \{\omega: X\le b\}\). Therefore
\[ \begin{aligned} &P(X\le b) = \\ &P(\lim_{n \rightarrow \infty} A_n) = \\ &\lim_{n \rightarrow \infty} P(A_n) = \\ &\lim_{n \rightarrow \infty} P(X\le b-\frac1{n}) \le\\ &\lim_{n \rightarrow \infty} P(X<b) = \\ &P(X<b) \end{aligned} \]
of course we have also \(P(X<b)\le P(X\le b)\), and so \(P(X<b)= P(X\le b)\).
The other equations follow similarly.
As a consequence we don’t have to worry about the equal sign when we deal with continuous random variables.
Another consequence is this: let X be a continuous r.v., then \(P(X=x)=0\) for all x. This is because
\[P(X=x)=P(x\le X \le x)=F(x)-F(x)=0\]
Let F be function that is increasing, right-continuous and has \(0 \le F(x) \le 1\), then there exists a random variable that has F as its cdf.
proof too deep for us. In essence one needs to define some sample space and probabilities that eventually have the desired distribution function. Kolmogorov showed in his original monograph how to do this using measure theory.
Let F be the cdf of a rv X. Then F has at most countably many points of discontinuity.
proof
F is increasing so any point of discontinuity is a jump point up. Let An be the set of all points where F jumps up by more then 1/n. Then |An|<n because 0<F<1. Let A be the set of all jump points of F, then
\[A= \bigcup A_n\]
Now it is known that the countable union of countable sets is again countable, and therefore A is countable.
Note another consequence of this proof: for any \(\epsilon >0\) there are at most finitely many points where F jumps up by more then \(\epsilon\).
We roll a fair die until the first “6”. Let the rv X be the number of rolls. Find the cdf F of X.
Solution: note \(X \in \{1,2,3,...\}\)
let Ai be the event “a six on the ith roll”, i=1,2,3, …. Then
\[ \begin{aligned} &P(X=k) = P(A_1^c \cap A_2^c \cap ..\cap A_{k-1}^c \cap A_k) =\\ &P(A_1^c)P(A_2^c ) ..P(A_{k-1}^c )P( A_k) = (\frac56)^{k-1}\frac16 \\ &P(X \le k) = \sum_{i=1}^k P(X=k) = \\ &\sum_{i=1}^k (\frac56)^{i-1}\frac16 = \\ &\sum_{j=0}^{k-1} (\frac56)^j\frac16 = \\ &\frac16 \frac{1-(5/6)^{(k-1)+1}}{1-5/6} = 1-(5/6)^k \\ \end{aligned} \]
so for \(k \le x<k+1\) we have F(x)=1-(5/6)k
Notice that the cdf is a step function. This is always the case for a discrete random variable.
The probability density function (pdf) of a discrete r.v. X is defined by
\[f(x) = P(X=x)\]
the pdf of X in the example above is given by f(x) = 1/6*(5/6)x-1 if \(x \in \{1,2,..\}\), 0 otherwise.
Note that it follows from the definition and the axioms that for any pdf f we have
\[ \begin{aligned} &f(x) \ge 0 \\ &\sum_x f(x) = 1 \end{aligned} \]
Say f(x)=c/x2, x=1,2,3,… is a pdf. Find c
\(1=\sum_x f(x) = \sum_{i=1}^\infty c/i^2 = c\pi^2/6\)
so \(c=6/\pi^2\).
say we have a coin that comes up heads with probability p. Let X be the number of heads in n flips of the coin. Find the pdf of X.
Let’s start with a couple of small n’s:
n=1:
S={H,T}
P(X=0)=P(T)=1-p P(X=1)=P(H)=p
n=2:
S={(H,H),(H,T),(T,H),(T,T)}
P(X=0) = P((T,T))=P(T)P(T) = (1-p)2
P(X=1) = P({(H,T),(T,H)}) = P(H)P(T)+P(T)P(H) = 2p(1-p)
P(X=2)=P((H,H))=P(H)P(H)=p2
n=3:
S={(H,H,H),(H,H,T),(H,T,H),(H,T,T),(T,H,H),(T,H,T),(T,T,H),(T,T,T)}
P(X=0) = P((T,T,T))=P(T)P(T)P(T)=(1-p)3
P(X=1) = P({(H,T,T),(T,H,T),(T,T,H)}) = 3P(H)P(T)P(T) = 3p(1-p)2
P({(H,T,T),(H,T,H),(T,H,H)}) = 3P(H)P(H)P(T) = 3p2(1-p)
P(X=3)=P((H,H,H))=P(H)P(H)P(H)=p3
apparently for some n we have something like
\(P(X=k)=c_{n,k} p^k(1-p)^{n-k}\)
what is cn,k ? From the first few cases one might guess
\(c_{n,k}={{n}\choose{k}}\)
Let’s verify this using an induction proof. We already did the base case. Assume the statement is true for all \(j<n\). To show that it then is also true for n we will use the law of total probability, conditioning on whether the last flip was a heads or a tails. Let Xn be X if we flip the coin n times. Now using the law od total probability on the outcome of the last flip we find
\[ \begin{aligned} &P(X_n=k) = \\ &P \left(X_{n-1}=k-1\vert H \right)P(H)+P \left(X_{n-1}=k\vert T \right)P(T) = \\ &{{n-1}\choose {k-1}}p^{k-1}(1-p)^{n-k}\times p+{{n-1}\choose {k}}p^{k}(1-p)^{n-1-k}\times (1-p)=\\ &\left[{{n-1}\choose {k-1}}+{{n-1}\choose {k}}\right]p^{k}(1-p)^{n-k}=\\ &\left[ \frac{(n-1)!}{(n-k)!(k-1)!}+\frac{(n-1)!}{(n-1-k)!k!}\right]p^{k}(1-p)^{n-k}=\\ &\left[ \frac{k}{n}\frac{n!}{(n-k)!k!}+ \frac{n-k}{n}\frac{n!}{(n-k)!k!}\right]p^{k}(1-p)^{n-k}=\\ &\left[ \frac{k}{n}+ \frac{n-k}{n}\right]\frac{n!}{(n-k)!k!}p^{k}(1-p)^{n-k}=\\ &\frac{n!}{(n-k)!k!}p^{k}(1-p)^{n-k} \end{aligned} \]
and we have verified our guess. By the way, this again is a standard probability distribution called the binomial distribution.
f is the density of the continuous random variable X iff
\[F(x) = \int_{- \infty}^x f(t) dt\]
or (if the cdf is differentiable at x) \(f(x)=F'(x)\).
Again it follows from the definition and the axioms that for any pdf f we have
\[ \begin{aligned} &f(x) \ge 0 \\ &\int_{-\infty}^\infty f(x) dx = 1 \end{aligned} \]
\(X \sim U[A,B]\)
\(x<A\) or \(x>B\): f(x)=F’(x) = 0
\(A<x<B\): f(x)=F’(x) = d/dx[ (x-A)/(B-A)] = 1/(B-A)
at x=A and x=B f is not defined because F is not differentiable
Show that \(f(x)= \lambda \exp(-\lambda x)\) if \(x>0\), 0 otherwise defines a pdf, where \(\lambda>0\).
clearly \(f(x) \ge 0\) for all x. Now
\[ \begin{aligned} &\int_{-\infty}^\infty f(x) dx = \\ &\int_{0}^\infty \lambda \exp(-\lambda x) dx = \\ &- \exp(-\lambda x) |_0^\infty = 0-(-1) = 1\\ \end{aligned} \]
This r.v. X is called an exponential r.v. with rate \(\lambda\). We often write \(X \sim Exp(\lambda)\).
Say \(f(x)=c/x^2\), \(x>1\) is a pdf. Find c.
\[1=\int_{-\infty}^\infty f(x)dx=\int_1^\infty c/x^2 dx = -c/x|_1^\infty=c\]
Say \(f(x)=cx\sin(\pi x)\), \(0 \le x \le 1\), 0 otherwise, is a pdf. Find c.
\[ \begin{aligned} &1=\int_{-\infty}^\infty f(x)dx = \\ &\int_0^1 cx\sin(\pi x)dx = \\ &c \left( x(-\frac{1}{\pi}\cos(\pi x))|_0^1 - \int_0^1 -\frac{1}{\pi}\cos(\pi x)) dx\right) = \\ &c \left( \frac{1}{\pi}+ \frac{1}{\pi^2}\sin(\pi x)|_0^1\right) = \frac{c}{\pi} \end{aligned} \] so \(c=\pi\).
Say \(f(x)=c \exp(-x^2)\), \(x>0\) is a pdf. Find c.
Unfortunately f does not have an anti-derivative, so this is tricky problem. Using numerical integration one can show
f <- function(x) exp(-x^2)
cc=integrate(f, 0, Inf)$value
1/cc
## [1] 1.128379
Later we will also see how this can be done analytically.
Linear functions, and a little more general polynomials, play a large role in most of mathematics. In probability theory however they are tricky. Say we want to model a density using a linear function \(f(x)=ax+b\), \(0<x<1\). For this to be a density we need
\(ax+b>0\) for all \(0<x<1\), so if \(a>0\), the function is increasing and its minimum is at 0, so we need \(b>0\). If \(a<0\), the function is deceasing and its minimum is at 1, so we need \(a+b>0\).
\(\int_0^1 ax+b dx =ax^2/2+bx|_0^1 = a/2+b=1\), or \(b=1-a/2\)
so if \(a>0\) we need \(b=1-a/2>0\) or \(a<2\). If \(a<0\) we need \(a+b=a+1-a/2=1+a/2>0\), or \(a>-2\).
so we find that (on [0,1]) \(f(x)=ax+1-a\) is a density if \(-2<a<2\).
Although we usually deal with random variables that are either discrete or continuous, in real life they can be mixed:
Consider the following experiment: we flip a fair coin. If the coin comes up heads \(X\) takes the value \(1/2\), otherwise we choose \(U \sim U[0,1]\). Find the cdf of X.
Note that \(F_U(x)=x\).
First we have \[ \begin{aligned} &F(x) =0\text{ if }x<0 \\ &F(x) =1\text{ if }x>1 \end{aligned} \] Now say \(0<x<1/2\), then
\[ \begin{aligned} &F(x) =P(X\le x) = \\ &P(X\le x|H)P(H)+P(X\le x|T)P(T) = \\ &0\times \frac12+P(U\le x)\frac12 = \\ &F_U(x)\frac12 = x/2 \end{aligned} \] Finally say \(1/2<x<1\), then
\[ \begin{aligned} &F(x) =P(X\le x) = \\ &P(X\le x|H)P(H)+P(X\le x|T)P(T) = \\ &1\times \frac12+P(U\le x)\frac12 = \\ &(1+F_U(x))\frac12 = (1+x)/2 \end{aligned} \]
Here is what this looks like:
cdf <- function(x) {
y=0*x
y[x<0]=0
y[x>=1]=1
y[0<x&x<1/2]=x[0<x&x<1/2]/2
y[1/2<=x&x<1]=(1+x[1/2<=x&x<1])/2
y
}
x=seq(-0.2, 1.2, 0.01)
plot(x, cdf(x), type="l")
In probability theory we are often interested in collections (for examples sequences) of random variables:
Let \(X_0 =0\), and \(P(X_n =1)=p\), \(P(X_n =-1)=q=1-p\). Let \(S_n = \sum_{k=0}^n X_k\) and define the event \(A_n = \{S_n =0\}\).
We want to find \(P(\{A_n \text{ i.o. }\})\). By Kolmogorov’s 0-1 law we know that it is either 0 or 1. But which is it?
Let \(P^{(n)} = P(A_n)\), then first of all \(P^{(2n+1)}=0\) for all n ecause we can never return to 0 in an even number of jumps. Now
\[ \begin{aligned} &P^{(2n)} = P(\text{n jumps to the left and n jumps to the right}) = \\ &{{2n}\choose n}p^n(1-p)^n \end{aligned} \] because each jump to the left happens with probability p and each jump to the right with probability 1-p, and there are \({{2n}\choose n}\) “paths”.
Recall Sterling’s formula:
\[n!\approx n^{n+0.5}e^{-n}\sqrt{2\pi}\] so
\[ \begin{aligned} &{{2n}\choose n}p^n(1-p)^n = \\ & \frac{(2n)!}{(n!)^2}p^n(1-p)^n \approx \\ & \frac{(2n)^{2n+0.5}e^{-2n}\sqrt{2\pi}}{(n^{n+0.5}e^{-n}\sqrt{2\pi})^2}p^n(1-p)^n = \\ & \frac{(2n)^{2n+0.5}}{n^{2n+1}\sqrt{2\pi}}p^n(1-p)^n = \\ & \frac{2^{2n}}{\sqrt{\pi n}}p^n(1-p)^n = \frac{[4p(1-p)]^n}{\sqrt{\pi n}} \end{aligned} \]
Now \(4p(1-p)\le 1\) and \(4p(1-p)= 1\) iff \(p=1/2\). Therefore
\[\sum_{n=1}^\infty P^{(n)} = \left\{ \begin{array} .=\infty&\text{ if } &p=1/2\\<\infty&\text{ if }&p\ne 1/2\end{array} \right.\] and by the Borel-Cantelli lemmas (1.2.11) and (1.2.15) we have \(P(\{A_n \text{ i.o. }\})=1\) if p=1/2 and 0 otherwise.
This eventually leads to the study of stochastic processes.