Say we have a sequence of numbers \(a_n\). Then there is just one definition of a “limit”, namely
\(a_n \rightarrow a\) iff for every \(\epsilon >0\) there exists an \(n_\epsilon\) such that \(|a_n -a|<\epsilon\) for all \(n>n_\epsilon\).
say \(a_n =(1+1/n)^n\). Show that \(a_n \rightarrow e\)
Fix n, and let t be such that \(1 \le t \le 1+1/n\). Then
\[ \begin{aligned} &\frac1{1+\frac1n} \le \frac1t\le 1 \\ &\int_1^{1+\frac1n} \frac1{1+\frac1n} dt \le \int_1^{1+\frac1n}\frac1t dt\le \int_1^{1+\frac1n}1dt \\ &\frac1{1+\frac1n}\frac1n \le \log(t)|_1^{1+\frac1n}\le \frac1n \\ &\frac1{1+\frac1n}\frac1n \le \log(1+\frac1n)\le \frac1n \\ &\frac1{1+n} \le \log(1+\frac1n)\le \frac1n \\ &\exp(\frac1{1+n})\le 1+\frac1n\le \exp(\frac1n) \end{aligned} \]
so we have the two inequalities
\[e\le \left(1+\frac1n\right)^{n+1}\] and \[\left(1+\frac1n\right)^{n}\le e\]
which we can combine to
\[ \begin{aligned} &\frac{e}{1+\frac1n}\le \left(1+\frac1n\right)^{n}\le e \\ &\frac{n}{n+1}e\le \left(1+\frac1n\right)^{n}\le e \end{aligned} \] Now \[ \begin{aligned} &|a_n-a| =|(1+\frac1n)^n-e|<\epsilon \\ &e-\epsilon<(1+\frac1n)^n<e+\epsilon \\ &e-\epsilon<\frac{n}{n+1}e \\ &(n+1)e-(n+1)\epsilon-ne<0\\ &n>\frac{e}{\epsilon}-1 \end{aligned} \] so fix an \(\epsilon >0\). Then if \(n>e/\epsilon-1\) we have
\[|(1+1/n)^n-e|<\epsilon\]
and therefore
\[(1+1/n)^n\rightarrow e\]
Things already get a little more complicated if we go to sequences of functions. Here there are two ways in which they can converge:
Let \(f_n(x),f(x)\) be some functions, then
Pointwise Convergence: \(f_n (x)\rightarrow f(x)\) pointwise iff for every x in S and every \(\epsilon >0\) there exists an \(n_{\epsilon,x}\) such that \(|f_n (x)-f(x)|<\epsilon\) for all \(n>n_{\epsilon,x}\).
Uniform Convergence: \(f_n (x)\rightarrow f(x)\) uniformy iff for every x in S and every \(\epsilon >0\) there exists an \(n_{\epsilon}\) such that \(|f_n (x)-f(x)|<\epsilon\) for all \(n>n_{\epsilon}\).
Note that the definitions are almost the same, except that for uniform convergence n depends on \(\epsilon\) but does not depend on x.
uniform convergence implies pointwise convergence but not vice versa.
proof
uniform convergence implies pointwise convergence: obvious, simply take \(n_{\epsilon,x}=n_{\epsilon}\).
pointwise convergence does not imply uniform convergence:
say \(f_n (x)=x^n\), \(S=[0,1]\), \(f(x)=I_{1}(x)\), then \(f_n (x)\rightarrow f(x)\) pointwise but not uniformly.
say \(x<1\) then \(|f_n (x)-f(x)| = x^n < \epsilon\) for all \(n > n_{\epsilon,x} = \log(\epsilon)/\log(x)\).
say \(x=1\) then \(|f_n (x)-f(x)| = 0 < \epsilon\) for all \(n > n_{\epsilon,x} = 1\)
but fix \(\epsilon<\frac1{2e}\) and assume
\[|f_n(x)-f(x)|=x^n<\epsilon\] for all \(x\in [0,1)\) and all \(n>n_\epsilon\).
\[ \begin{aligned} &x = 1+\frac1n\\ &x^n = (1+\frac1n)^n >\frac1{2e} \end{aligned} \]
say \(f_n (x)=1+x/n\), \(x \in S=[A,B]\) where \(A<B\), \(f(x)=1\), then \(f_n (x)\rightarrow f(x)\) uniformly.
\[|f_n (x)-f(x)| = |1+x/n-1| =\\ |x/n| \le \max(|A|,|B|)/n < \epsilon\]
if \(n \ge \max(|A|,|B|)/\epsilon\)
Now when we go to probabilities it gets a bit more complicated still. Say we have a sequence of rv’s \(X_n\) with means \(\mu_n\) and cdf’s \(F_n\) , and a rv \(X\) with mean \(\mu\) and cdf \(F\). Then we have:
Convergence in Mean: \(X_n \rightarrow X\) in mean iff \(\mu_n \rightarrow \mu\).
Convergence in Quadratic Mean: \(X_n \rightarrow \mu\) in quadratic mean iff \(E[X_n ]\rightarrow \mu\) and \(var(X_n )\rightarrow 0\).
Convergence in Lp: \(X_n \rightarrow X\) in Lp iff \(E[|X_n -X|^p]\rightarrow 0\). Here \(p>0\).
Convergence in Distribution (Law) \(X_n \rightarrow X\) in law iff \(F_n (x)\rightarrow F(x)\) pointwise for all x where F is continuous.
Convergence in Probability: \(X_n \rightarrow X\) in probability iff for every \(\epsilon >0 \lim_{n\rightarrow \infty} P(|X_n -X| \ge \epsilon )=0\).
Almost Sure Convergence: \(X_n \rightarrow X\) almost surely iff for every \(\epsilon >0\) \(P(\lim_{n\rightarrow \infty} |X_n -X|< \epsilon )=1\).
Remark
Convergence in quadratic mean implies that the limit is a constant, not a random variable.
Let \(X_n\) have density \(f_n (x)=nx^{n-1}, 0<x<1\) (or \(X_n\sim Beta(n,1)\)), and let \(X\) be such that \(P(X=1)=1\). Then
\[E[X_n] =\frac{n}{n+1}\rightarrow 1 = E[X]\]
so we have convergence in mean.
\[var[X_n] =\frac{n}{(n+1)^2(n+2)}\rightarrow 0\] and so we have convergence in quadratic mean.
\[ \begin{aligned} &E[|X_n-X|^p] = E[|X_n-1|^p]\\ &\int_0^1 |x-1|^pnx^{n-1} dx = \\ &\int_0^1 (1-x)^pnx^{n-1} dx = \\ &n\int_0^1 x^{n-1}(1-x)^p dx = \\ &n\frac{\Gamma(n)\Gamma(p+1)}{\Gamma(n+p+1)}\int_0^1 \frac{\Gamma(n+p+1)}{\Gamma(n)\Gamma(p+1)}x^{n-1}(1-x)^{(p+1)-1} dx = \\ &n\frac{\Gamma(n)\Gamma(p+1)}{\Gamma(n+p+1)}=\\ &n\frac{(n-1)!\Gamma(p+1)}{(n+p)...(p+1)\Gamma(p+1)}=\\ &\frac{n!}{(n+p)...(p+1)}=\\ &\prod_{k=1}^n\frac{k}{k+p}=\\ &\exp \left\{\log\left[\prod_{k=1}^n\frac{k}{k+p}\right]\right\}=\\ &\exp \left\{\sum_{k=1}^n\log\frac{k}{k+p}\right\}=\\ &\exp \left\{\sum_{k=1}^n\log(1-\frac{p}{k+p})\right\}\approx\\ &\exp \left\{-\sum_{k=1}^n\frac{p}{k+p}\right\}\rightarrow 0 \end{aligned} \] where we used the Taylor approximation \(\log (1+x) \approx x\) and the divergence of the harmonic sum. So we have convergence in \(L^p\).
Note that \(F_X(x)=I_{[1, \infty)}(x)\). Now
\[ \begin{aligned} &x<0: F_{X_n}(x)=0=F_X(x) \\ &x>1: F_{X_n}(x)=1=F_X(x) \\ &0<x<1: F_{X_n}(x) = x^n \rightarrow 0=F_X(x) \\ \end{aligned} \]
and so we have convergence in distribution.
\[ \begin{aligned} &P(\vert X_n-X\vert>\epsilon) = \\ &1-P(\vert X_n-1\vert<\epsilon) = \\ &1-P(-\epsilon< X_n-1<\epsilon) = \\ &1-P(1-\epsilon< X_n<1+\epsilon) = \\ &1-\left[F_{X_n}(1+\epsilon)-F_{X_n}(1-\epsilon) \right]=\\ &1-\left[1-F_{X_n}(1-\epsilon) \right]=\\ &F_{X_n}(1-\epsilon) = (1-\epsilon)^n\rightarrow 0 \end{aligned} \]
and so we have convergence in probability.
Finally, define
\[ \begin{aligned} &A_{n,\epsilon} =:\left\{\omega: \vert X_n(\omega)-1\vert<\epsilon \right\} = \\ &\left\{\omega: 1-\epsilon< X_n(\omega)<1+\epsilon \right\} = \\ &\left\{\omega: X_n(\omega)>1-\epsilon \right\} \\ \end{aligned} \]
Let \(m>n\), and assume \(\omega \not\in A_{n,\epsilon}\), so \(\omega^{n}<1-\epsilon\), and so \(\omega^{m}<\omega^{n}<1-\epsilon\), and therefore \(\omega \not\in A_{m,\epsilon}\). So the sequence of sets \(A_{n,\epsilon}\) is a decreasing sequence in n (for fixed \(\epsilon\)). Therefore
\[P(\lim_{n\rightarrow\infty}A_{n,\epsilon})=\lim_{n\rightarrow\infty}P(A_{n,\epsilon})=1-(1-\epsilon)^n\rightarrow 1\] and we have convergence almost surely.
So \(X_n \rightarrow X\) in all the ways possible.
Notice that proofs of almost sure convergence usually involve returning to the definition of a random variable as a function from the sample space to \(\mathbb{R}\), and therefore usually require measure theory.
Say \(X_n\) has density \(f(x)=n/(n+1) x^{1/n}, 0<x<1\) and \(X \sim U[0,1]\). Then
\[ \begin{aligned} &E[X_n] = \int_{-\infty}^{\infty} xf_n(x)dx = \\ &\int_0^1xn/(n+1) x^{1/n}dx=\\ &\frac{n}{n+1}\frac1{2+\frac1n} x^{2+1/n}|_0^1 = \\ &\frac{n}{n+1}\frac1{2+\frac1n}\rightarrow \frac12 =E[X]\\ &\\ &E[X_n^2] = \int_{-\infty}^{\infty} x^2f_n(x)dx = \\ &\int_0^1x^2n/(n+1) x^{1/n}dx=\\ &\frac{n}{n+1}\frac1{3+\frac1n} x^{3+1/n}|_0^1 = \\ &\frac{n}{n+1}\frac1{3+\frac1n}\rightarrow \frac13 \\ \end{aligned} \]
so
\[var[X_n]\rightarrow 1/3-(1/2)^2 = 1/12 \ne 0\]
so we don’t have convergence in quadratic mean. We do have convergence in distribution, though:
\[F_n(x)=x^{1+\frac1n}\rightarrow x=F(x)\] for all \(0<x<1\).
How about convergence in probability? As stated here that can not be decided because we would need the joint density of \(X_n\) and \(X\). If they are independent \(X_n\) will not converge to \(X\) in probability:
\[ \begin{aligned} &P(|X_n-X|>\epsilon) = \\ &1-P(-\epsilon<X_n-X<\epsilon) = \\ &1-\int_0^1P(-\epsilon<x-X<\epsilon)f_n(x) dx \approx \\ &1-\int_0^12\epsilon f_n(x) dx =1-2\epsilon>0 \\ \end{aligned} \]
In fact this is always true: if \(X_n\) and \(X\) are independent, \(X_n\) can never converge to \(X\) in probability!
Unfortunately there is no simple hierarchy between the different modes of convergence. Here are some relationships:
convergence in quadratic mean implies convergence in probability.
convergence in probability implies convergence in distribution.
If the limit is a constant convergence in distribution implies convergence in probability.
almost sure convergence implies convergence in probability, but not vice versa.
proof:
\[ \begin{aligned} &P(\vert X_n-X\vert < \epsilon) = \\ &P(\mu-\epsilon<X_n<\mu+\epsilon) = \\ &P(\mu-\mu_n-\epsilon<X_n-\mu_n<\mu-\mu_n+\epsilon) = \\ &P(-(\mu_n-\mu+\epsilon)<X_n-\mu_n<\mu-\mu_n+\epsilon) \ge \\ &P(\vert X_n-\mu_n\vert) <m_n+\epsilon) = \\ &P(\vert X_n-\mu_n\vert <m_n+\epsilon) \ge \\ &1-\frac{var[X_n]}{(m_n+\epsilon)^2}\rightarrow 1 \end{aligned} \]
by Chebyshev’s inequality, where
\[m_n = \max\{-|\mu_n-\mu|;|\mu-\mu_n|\}\]
\[ \begin{aligned} &F(x-\epsilon)=P(X\le x-\epsilon) = \\ &P(X\le x-\epsilon,|X_n-X|<\epsilon\bigcup |X_n-X|>\epsilon) = \\ &P(X\le x-\epsilon,|X_n-X|<\epsilon)+P(X\le x-\epsilon,|X_n-X|>\epsilon) \le \\ &P(X\le x-\epsilon,X_n-\epsilon<X<X_n+\epsilon)+P(|X_n-X|>\epsilon) \le \\ &P(X_n\le x-\epsilon,X_n-\epsilon<X<X_n+\epsilon)+P(|X_n-X|>\epsilon) \le \\ &P(X_n\le x)+P(|X_n-X|>\epsilon) = \\ &F_n(x)+P(|X_n-X|>\epsilon)\\ &F(x-\epsilon)\le F_n(x)+P(|X_n-X|>\epsilon) \end{aligned} \] a similar calculation shows
\[F(x-\epsilon)\ge F_n(x)-P(|X_n-X|>\epsilon)\] and so
\[F(x-\epsilon)-F(x)- P(|X_n-X|>\epsilon)\le\] \[F_n(x)-F(x)\le\] \[F(x+\epsilon)-F(x)+ P(|X_n-X|>\epsilon)\]
\[ \begin{aligned} &F_n(x)\rightarrow F(x) = I_{[c,\infty)}(x)\\ &P(|X_n-c|>\epsilon) = \\ &P(X_n<c-\epsilon)+P(X_n>c+\epsilon) = \\ &F_n(c-\epsilon)+\left(1-F_n(c+\epsilon)\right)\rightarrow 0+(1-1)=0 \end{aligned} \]
\[ \begin{aligned} &A_1=[0,1] \\ &A_2=[0,\frac12]\text{, }A_3=[\frac12,1] \\ &A_4=[0,\frac14]\text{, }A_5=[\frac14,\frac12]&A_6=[\frac12,\frac34]\text{, }A_7=[\frac34,1] \\ \end{aligned} \]
and so on. In general we have
\[A_{2^m+i}=[\frac{i}{2^m},\frac{i+1}{2^m}]\] \(m=0,1,2,...;i=0,...,2^m-1\)
Let \(Y_n=I_{A_n}(X)\) and \(Y=\delta_0\), then
\[ \begin{aligned} &P(|Y_n-Y|>\epsilon) = \\ &P(Y_n>\epsilon) = \\ &P(Y_n=1) = \\ &P(I_{A_n}(X)=1) = \\ &P(X\in A_n)=\frac1{2^m}\rightarrow 0 \end{aligned} \]
and so \(Y_n \rightarrow Y\) in probability.
However, for every \(x\) in \([0,1]\) \(Y_n(x)=0\) infinitely often and \(Y_n(x)=1\) infinitely often, therefore \(Y_n\) does not converge to \(Y\) almost surely.
If \(X_n \rightarrow 0\) in \(L^p\), then \(X_n \rightarrow 0\) in probability.
Note: \(X_n \rightarrow X\) iff \(X_n -X\rightarrow 0\).
proof
Say \(X_n \rightarrow 0\) in \(L^p\), and let \(g(x)=|x|^p\), then by Chebyshev’s inequality
\[P(|X_n| \ge \epsilon ) \le E[|X_n|^p]/\epsilon ^p \rightarrow 0\]
and so \(X_n \rightarrow 0\) in probability.
Slutsky
Say \(X_n \rightarrow X\) in distribution and \(g\) is any continuous function, then \(g(X_n )\rightarrow g(X)\) in distribution.
without proof
From the above results, it is easy to show that if \(X_n \rightarrow X\) in distribution and \(Y_n \rightarrow c\) in distribution, then
\(X_n +Y_n \rightarrow X+c\) in distribution.
\(X_n Y_n \rightarrow cX\) in distribution
which is the most common version of Slutsky’s theorem.
In loose terms, the theorem states that if a rv converges to a constant, then it essentially behaves as a constant for addition and multiplication.
Note that the condition that \(Y_n\) converge to a constant is necessary.
Consider a number \(\omega \in [0,1]\), and let’s write the number using its decimal expansion as
\[\omega = 0.x_1 x_2 x_3 ...\]
Let \(T_n\) be the set of \(\omega\)’s with \(x_n>0\) and \(x_m=0\) for \(m>n\). Then \(T_n\) has finitely many numbers. Let \(T\) be the set of \(\omega\)’s with a terminating expansion, then
\[T= \bigcup _n T_n\]
and so \(T\) is countable. If we consider the game: pick a number in [0,1] at random, we therefore have \(P(T)=0\) because the total number of real numbers in [0,1] is uncountable!
Now let k be such that \(0 \le k \le 9\), and define \(\nu(\omega,n,k)\) to be the number of \(x_i\)’s among the first n digits in the expansion of \(\omega\) with \(x_i=k\). Define (if it exists)
\[\lim_{n\rightarrow\infty}\frac{\nu(\omega,n,k)}n=\phi_k(\omega)\]
so \(\phi_k(\omega)\) is the long run relative frequency of the digit k in the expansion of \(\omega\).
Intuitively it seems obvious that \(\phi_k(\omega)=1/10\) for all k for “most” \(\omega\)’s, but is actually very simple to write down numbers for which this is not true: 0.1111…, 0.121212… etc.
Numbers for which \(\phi_k (\omega)=1/10\) for all k are called simply normal, and how many such numbers there are was an open question in Number Theory for a long time.
Borel
Let \(N\) be the set of simply normal numbers. Then \(P(N)=1\).
proof (outline)
Let \(\zeta_i\) be a random variable with \(P(\zeta_i =k)=1/10\). Let the sequence of \(\zeta_i\)’s be independent and set
\[\omega=0.\zeta_1 \zeta_2 \zeta_3 ...\]
Let \(S_n = \zeta_1 +...+\zeta_n\), then by the strong law of large numbers
\[S_n /n\rightarrow 1/10\] almost surely
therefore \(P(\phi_k =1/10)=1\) for all k, and finally
\[P \left(\bigcap_{k=0}^9 \left[\phi_k=\frac1{10} \right]=1 \right)\]
Considered as a problem in probability theory this seems almost trivial. 100 years ago, though, probability theory was considered a branch of mathematics hardly worthy of the name, and indeed before the advent of modern probability theory in the 1930’s it was a strange field of strange results. When Borel published his proof it came as a shock to many mathematicians that this strange field could yield interesting results in the purest of fields, Number Theory.
Note that although \(N\) has probability 1, \(N^c\) is still uncountable. For example, take all the numbers in \(N\) and remove all the 0’s, then all these numbers are in \(N^c\), but there are clearly still uncountably many of these!
So, is \(\pi-e = 0.423310825131..\) simply normal? Actually there does not exist a mathematical theory even today that would allow us to answer that question easily.
Let’s consider the following generalization: Let \(k_1 ..k_r\) be a “block” of r consecutive numbers in the expansion of \(\omega\), let \(\nu(\omega,n,k_1 ..k_r )\) be the number of times the block occurs in \(\omega\) up to n and let
\[\phi(\omega,k_1 ..k_r ) = \lim_{n\rightarrow\infty} \nu(\omega,n,k_1 ..k_r )/n\]
Consider the set \(A \in [0,1]\) such that for \(\omega \in A\) \(\phi ( \omega,k_1 ..k_r ) = 1/10^r\) for all possible blocks \(k_1 ..k_r\).
Now the same reasoning as above shows that \(P(A)=1\).
A number for which this is true not only for any r but also if we change from base 10 to any other base is called normal.
Consider what this implies: Take your name, “code” it into a string of numbers. Pick a real number \(\omega\) at random, then with probability 1 somewhere in the extension of that number you will find your name!
It gets weirder:
you will not find it it just once but over and over again.
let x be any real number and let \(\epsilon >0\), then this not only true for the interval [0,1] but also for the interval \([x,x+\epsilon]\)
It does not matter how large r is! Take the complete works of Shakespeare, they are in there too!
Is this really true? Our math seems to say yes, but so far no direct prove has been done.
Let’s do the following: we code our names and try to find them in \(\pi\)!. First we need a lot of digits of \(\pi\), I have one million of them in a file:
tpi=scan("http://academic.uprm.edu/wrolke/esma6600/million-digits-pi.txt", what="char")
substr(tpi, 1, 1000)
## [1] "3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593344612847564823378678316527120190914564856692346034861045432664821339360726024914127372458700660631558817488152092096282925409171536436789259036001133053054882046652138414695194151160943305727036575959195309218611738193261179310511854807446237996274956735188575272489122793818301194912983367336244065664308602139494639522473719070217986094370277053921717629317675238467481846766940513200056812714526356082778577134275778960917363717872146844090122495343014654958537105079227968925892354201995611212902196086403441815981362977477130996051870721134999999837297804995105973173281609631859502445945534690830264252230825334468503526193118817101000313783875288658753320838142061717766914730359825349042875546873115956286388235378759375195778185778053217122680661300192787661119590921642019"
How do we code? Simple: a=01, b=02, .. , i=9, j=10m ,.., s=19, t=20, .., z=26. So we need two digits, however only up to 26. For the numbers 27-99 we can use d mod 27!
Here is coding my first name:
namecode=function(nme) {
lnme=strsplit(nme, "")[[1]]
out=lnme
for(i in seq_along(lnme)) {
tmp=(1:26)[letters==lnme[i]]
if(tmp<10) out[i] <- paste0("0", tmp)
else out[i]=paste(tmp)
}
out
}
namecode("wolfgang")
## [1] "23" "15" "12" "06" "07" "01" "14" "07"
So now let’s break up \(\pi\) in pieces of 2
tpi=substring(tpi, 3, nchar(tpi)) # remove 3.
pi2=substring(tpi, seq(1, nchar(tpi), 2), seq(2, nchar(tpi), 2))
pi2[1:20]
## [1] "14" "15" "92" "65" "35" "89" "79" "32" "38" "46" "26" "43" "38" "32" "79"
## [16] "50" "28" "84" "19" "71"
# turn them into numbers, calculate modula 27
pi3=as.numeric(pi2) %% 27
pi3[1:20]
## [1] 14 15 11 11 8 8 25 5 11 19 26 16 11 5 25 23 1 3 19 17
and now find the name:
find.name=function(nme) {
code.name=namecode(nme)
n=length(code.name)
for(i in 1: (length(pi3)-n)) {
if(all(code.name==pi3[i:(i+n)])) {
cat("Found name in Position ", i)
return(i)
}
}
cat("Could not find name")
}
find.name("wo")
## Found name in Position 5650
## [1] 5650
To read more about normal numbers go here: Normal Number