Weak Law of Large Numbers WLLN
Let \(X_1 , X_2 , ...\) be a sequence of independent and identically distributed (iid) r.v.’s having mean \(\mu\). Then \(\bar{X}\) converges to \(\mu\) in probability.
proof (assuming in addition that \(var(X_i )=\sigma^2 < \infty\))
\[ \begin{aligned} &E[\bar{X}] =E[\frac1n\sum_{i=1}^n X_i] =\frac1n\sum_{i=1}^n E[X_i]=\mu\\ &var(\bar{X}) =var(\frac1n\sum_{i=1}^n X_i) =\frac1n^2\sum_{i=1}^n var(X_i)=\sigma^2/n\rightarrow 0 \end{aligned} \] so \(\bar{X} \rightarrow \mu\) in quadratic mean and therefore in probability.
It is best to think of this (and other) limits theorems not as one theorem but as a family of theorems, all with the same conclusion but with different conditions. For example there are weak laws even if the \(X_n\)’s are not independent, don’t have the same mean and don’t even have finite standard deviations.
This theorem forms the bases of (almost) all simulation studies: say we want to find a parameter \(\theta\) of a population. We can generate data from a random variable X with pdf \(f(x|\theta)\) such that \(E[h(X)] = \theta\). Then by the law of large numbers \[\overline{h(X)}=\frac{h(X_1)+...+h(X_n)}{n}\rightarrow E[h(X)]=\theta\]
Strong Law of Large Numbers SLLN
Let \(X_1 , X_2 , ...\) be a sequence of independent and identically distributed (iid) r.v.’s having mean \(\mu\). Then \(\bar{X}\) converges to \(\mu\) almost surely.
proof needs some measure theory, can be found in most standard textbooks
in a game a player rolls 5 fair dice. He then moves his game piece along k fields on a board, where k is the smallest number on the dice + largest number on the dice. For example if his dice show 2, 2, 3, 5, 5 he moves 2+5 = 7 fields. What is the mean number of fields \(\theta\) a player will move?
To do this analytically would be quite an exercise. To do it via simulation is easy. Using R we can this as follows:
x <- matrix(sample(1:6,
size=5*10^5,
replace=TRUE), nrow=5)
z <- apply(x, 2, min) + apply(x, 2, max)
round(mean(z), 1)
## [1] 7
As with all large sample theorems, it always raises the question of how good it is for some finite n. Let’s investigate this a little bit: say \(X_1,..,X_n\) are iid Bernoulli \(\frac12\). Then by the WLLN \(\bar{X}_n\rightarrow \frac12\) in probability, so for any \(\epsilon>0\)
\[P(\vert \bar{X}_n-\frac12 \vert>\epsilon)\rightarrow0\]
\[ \begin{aligned} &P(\vert \bar{X}_n-\frac12 \vert>\epsilon) = \\ &P(\vert \sum_{i=1}^n X_i-n/2 \vert>n\epsilon) = \\ &P(\vert Y_n-n/2 \vert>n\epsilon) \\ \end{aligned} \]
where \(Y_n\sim Bin(n,1/2)\). Moreover
\[ \begin{aligned} &P(\vert Y_n-n/2 \vert>n\epsilon) \approx \text{ (by symmetry)}\\ &2P(Y_n>n(1/2+\epsilon)) \le \\ &2e^{-tn(1/2+\epsilon)}\psi_{Y_n}(t) = \\ &2e^{-tn(1/2+\epsilon)}(1/2+1/2e^{t})^n = \\ &2\exp\{-tn(1+2\epsilon)/2+n\log (1+e^{t})-n\log 2 \} \\ &2\exp\{-n\left[t(1+2\epsilon)/2-\log (1+e^{t})+\log 2\right] \} \\ \end{aligned} \]
for all t by the Chernoff bound.
Let’s see what this looks like as a function of t:
n=50; eps=0.01
curve(-n*(x*(1+2*eps)/2-log(1+exp(x))+log(2)), -0.5, 0.5, ylab="", xlab="t")
so there is a unique minimum at
\[ \begin{aligned} &\frac{d}{dt}\left\{ t(1+2\epsilon)/2-\log (1+e^{t}) +\log 2\right\} = \\ &(1+2\epsilon)/2 - e^t/(1+e^t) = 0\\ &(1+2\epsilon)(1+e^t) = 2e^t \\ &\frac{1+2\epsilon}{1-2\epsilon} = e^t \\ &t = \log\left\{ \frac{1+2\epsilon}{1-2\epsilon} \right\} \end{aligned} \]
log((1+2*eps)/(1-2*eps))
## [1] 0.04000533
and so we find
t = log((1+2*eps)/(1-2*eps))
t*(1+2*eps)/2-log((1+exp(t)))+log(2)
## [1] 0.0002000133
\[ \begin{aligned} &P(\vert \bar{X}_n-\frac12 \vert>\epsilon) \le 2e^{-0.0002n}\\ \end{aligned} \]
Let’s check. We of course have
\[ \begin{aligned} &P(\vert Y_n-n/2 \vert>n\epsilon) = \\ &1-P(n(0.5-\epsilon) \le Y_n \le n(0.5+\epsilon)) = \\ &1-\left(pbinom(n(0.5+\epsilon), n, 1/2) - pbinom(n(0.5-\epsilon), n, 1/2) \right) \\ \end{aligned} \]
n=100*1:200
y=1-(pbinom(n*(0.5+eps), n, 1/2) - pbinom(n*(0.5-eps), n, 1/2) )
plot(n, y, pch=".")
curve(2*exp(-0.0002*x), 1, 20000, add=TRUE)
so this seems to work!
Weierstrass Approximation Theorem
Let \(f\) be a continuous function on [0,1], and let \(\epsilon>0\). Then there exists a polynomial \(p\) such that
\[| f(x) - p(x) | < \epsilon\] for all \(x\in [0,1]\).
Note that this theorem is from real analysis, it has nothing whatsoever to do with probability theory. It is actually a special (early) version of one of the most famous theorems in real analysis, the Stone-Weierstrass theorem.
proof
We will consider polynomials of the form
\[p_n(x)=\sum_{k=0}^nf(\frac{k}{n}) {{n}\choose{k}} x^k(1-x)^{n-k}\]
which are called Bernstein polynomials after Sergei Bernstein. Note their connection to the binomial distribution as well as the beta distribution.
We will show that for any \(\epsilon >0\) there exists an \(n(\epsilon)\) such that
\[| f(x) - p_{n(\epsilon)} (x) | < \epsilon\]
for all \(x \in [0,1]\).
For each x, consider a sequence of Bernoulli trials \(\{X_n \}\) with success probability x, and let \(S_n = \sum_{k=1}^n X_k\). We know that \(S_n \sim Bin(n,x)\), so that
\[ \begin{aligned} &P(S_n=k) ={{n}\choose{k}} x^k(1-x)^{n-k} \\ &E[f(\frac{S_n}{n})] = \sum_{k=0}^nf(\frac{k}{n}) {{n}\choose{k}} x^k(1-x)^{n-k} = p_n(x) \end{aligned} \]
\(S_n\) is a sum of independent random variables with a finite variance, so by the weak law of large numbers
\[S_n /n \rightarrow E[f(X)] = f(x)\]
in probability, and we have shown pointwise convergence of \(p_n(x)\) to \(f(x)\). It remains to show uniform convergence. Let \(\delta>0\), then
\[ \begin{aligned} &|p_n(x)-f(x)| = \\ &| \sum_{k=0}^nf(\frac{k}{n}) {{n}\choose{k}} x^k(1-x)^{n-k}- \sum_{k=0}^nf(x) {{n}\choose{k}} x^k(1-x)^{n-k}| = \\ &|\sum_{k=0}^n\left[f(\frac{k}{n})-f(x)\right] {{n}\choose{k}} x^k(1-x)^{n-k}|\le\\ &\sum_{k=0}^n|f(\frac{k}{n})-f(x)| {{n}\choose{k}} x^k(1-x)^{n-k}=\\ &E \left[|f(\frac{S_n}{n})-f(x)| \right] = \\ &E \left[|f(\frac{S_n}{n})-f(x)|;|\frac{S_n}n-x|>\delta \right]+E \left[|f(\frac{S_n}{n})-f(x)|;|\frac{S_n}n-x|<\delta \right] \end{aligned} \]
where we use the triangle inequality \(|a+b| \le |a| + |b|\).
Now f is continuous on a compact interval, so f is uniformly continuous. Therefore for any \(\epsilon >0\) there exists a \(\delta>0\) (independent of x and y ) such if \(| x - y | < \delta\) we have \(| f(x) - f(y) | < \epsilon/2\).
With this choice of \(\delta\) the second term above is bounded by \(\epsilon/2\).
On the other hand again using the triangle inequality again and the fact that [0,1] is a compact set we have
\[| f(x) - f(y) | \le | f(x) |+| -f(y) | \le\]
\[2\max\{|f|:0 \le x \le 1\} =: M < \infty\]
and so the first term is bounded above by \(M P(|S_n /n-x|>\delta)\).
Finally we have \(E[S_n]=nx\) and \(var(S_n ) = nx(1-x) \le n/4\) for \(0 \le x \le 1\).
By Chebyshev’s inequality we have
\[ \begin{aligned} &P(|S_n /n-x|> \delta ) \le \\ &var(S_n )/ \delta ^2 = \\ & nx(1-x)/( \delta ^2n^2) \le \\ &1/(4 \delta ^2n)\\ \end{aligned} \]
and so we have bounded the first term above as well.
Let \(\{X_n ; n=1,2,..\}\) be a sequence of random variables. Let \(S_n = X_1 +..+X_n\). What can be said about the convergence of such a series? The most famous result is
Kolmogorov’s Three Series Theorem 1929
Let \(\{X_n \}\) be independent and define for some fixed \(A>0\) a random variable \(Y_n\) as \(Y_n=X_n\) if \(|X_n|\le A\) and 0 otherwise. (Essentially, \(Y_n\) is \(X_n\) truncated at \(\pm A\)).
Then the series \(\sum X_n\) converges almost surely if and only if the following three series converge:
proof omitted
As we know
\[\sum 1/n = 1+\frac12 +\frac13 +\frac14 + ... = \infty\]
and
\[\sum(-1)^{n+1}/n = 1-\frac12 +\frac13 -\frac14 + ... = \log(2) < \infty\]
How about something in between, namely if in each term of the sum we choose -1 with probability p and 1 with probability 1-p? That is let \(Z_n \sim\) Ber(p), and
\[X_n=(-1)^{Z_n}/n\]
In the three-series theorem let’s take \(A=1\), then \(Y_n =X_n\), and so the sum in 1 is 0.
For 2 we have
\[ \begin{aligned} &E[Y_n] = E[X_n]= E \left[ (-1)^{Z_n}/n \right] =\\ &\frac1n \left((-1)^0\times p+(-1)^1\times (1-p) \right) = \frac{2p-1}n\\ &\sum_{n=1}^\infty E[Y_n] = \sum_{n=1}^\infty \frac{2p-1}n = \left\{\begin{array}.0&\text{ if }&p=1/2\\\pm\infty&\text{ if }&p\ne1/2 \end{array} \right. \end{aligned} \]
and so the sum can only converge if \(p=1/2\). To see whether it really does we need to check 3:
\[var(Y_n)=E[Y_n^2]=E[X_n^2]=E \left[\left( (-1)^{Z_n}/n\right)^2 \right]=\frac1{n^2}\] \[\sum_{n=1}^\infty var(Y_n) = \sum_{n=1}^\infty \frac1{n^2}<\infty\] and so indeed the sum converges almost surely.
Let \(Z_n\sim Ber(1/2)\) and \(X_n=(-1)^{Z_n}/\sqrt n\). Again we use \(A=1\), so \(P(X_n\ne Y_n)=0\). Also \(E[Y_n]=0\). Finally
\[var(Y_n)=E[Y_n^2]=E[X_n^2]=E \left[\left( (-1)^{Z_n}/\sqrt n\right)^2 \right]=\frac1{n}\] \[\sum_{n=1}^\infty var(Y_n) = \sum_{n=1}^\infty \frac1{n}=\infty\] and so this random sum does not converge almost surely, although the ’fixed-sign” sum \(\sum (-1)^n/\sqrt{n}\) does.
As we all know, \(\sum \frac1{n}=\infty\). But what about \(\sum_{p\text{ prime}} \frac1{p}\)?
One of the most famous theorems in all of mathematics, the prime number theorem, says the following: let \(\psi(n)\) be the number of primes \(p\le n\), then
\[\psi(n) \approx \frac{n}{\log n}\]
Another way of stating this is that the probability that a randomly chosen integer is a prime is about \(\frac1{\log n }\).
This let Harold Cramer to consider a probabilistic model for primes: Define a sequence of random variables \(\{X_n\}\) with \(P(X_n=1)=1-P(X_n=0)=\frac1{\log n}\).
Now consider the sum \(S_n=\sum_{i=1}^n X_i\).
Let \(A=1\), so \(Y_n=X_n\)
\[ \begin{aligned} &\sum_n EY_n = \\ &\sum_n (0\times(1-\frac1{\log n})+1\times\frac1{\log n}) \\ &\sum_n \frac1{\log n} = \infty \end{aligned} \] and so by Kolmogorov’s three series theorem \(S_n\rightarrow \infty\) almost surely. This strongly suggests that
\[\sum_{p\text{ prime}} \frac1{p}=\infty\]
Actually, this was proved in 1731 by Leonhard Euler. However, Cramer was also able to prove the famous Riemann hypothesis for such “probabilistic primes”, whereas this is still an open problem for the “real” primes. Cramer’s work was the beginning of what is now known as probabilistic number theory.