Inequalities are very important in probability theory, both for the theory and for practical applications.
We start with a lemma that has nothing to do with probability:
let a and b be any positive numbers, and let p and q be any positive numbers with \(1/p+1/q=1\). Then
\[\frac{a^p}{p}+\frac{b^q}{q}\ge ab\]
with “=” iff \(a^p=b^q\)
proof
First note that \(1/p+1/q=1;p,q>0\) implies that \(p,q>1\).
Now we fix b, and consider the function g with
\[g(a)=\frac{a^p}{p}+\frac{b^q}{q}-ab\] now
\[ \begin{aligned} &\frac{dg}{da} = a^{p-1}-b=0\rightarrow a=b^{1/(p-1)}\\ &\frac{d^2g}{da^2}|_{a=b^{1/(p-1)}} = (p-1)a^{p-2}|_{a=b^{1/(p-1)}} = (p-1)b^{(p-2)/(p-1)}>0 \end{aligned} \]
so \(g\) has a minimum at \(a^{p-1}=b\). Also
\[\frac{a^p}{p}+\frac{a^{q(p-1)}}{q}-aa^{p-1}=\frac{a^p}{p}+\frac{a^{p}}{q}-a^{p}=0\]
because \((p-1)q=p\). This follows from
\[1/p+1/q=1\rightarrow 1+p/q=p\rightarrow p/q=p-1\]
Moreover the minimum of \(g\) is unique because \(g\) is convex for all a, so “=” holds iff \(a^{p-1}=b\), which is the same as \(a^p=b^q\).
Hoelder’s Inequality
Let \(X\) and \(Y\) be any two rvs, and let p and q be as above. Then
\[|E[XY]| \le E[|XY|] \le (E[|X|^p])^{1/p}(E[|Y|^q])^{1/q}\]
proof
The first “\(\le\)” follows from
\[-|XY| \le XY \le |XY|\] For the second \(\le\) define
\[a=\frac{|X|}{(E[|X|^p])^{1/p}}\] \[b=\frac{|Y|}{(E[|Y|^q])^{1/q}}\]
It follows from the lemma that
\[\frac1p \frac{|X|^p}{E[|X|^p]}+\frac1q \frac{|Y|^q}{E[|Y|^q]}\ge \frac{|X|}{E[|X|^p]^{1/p}} \frac{|Y|}{(E[|Y|^q])^{1/q}}\]
and taking expectations we find
\[E\left[\frac1p \frac{|X|^p}{E[|X|^p]}+\frac1q \frac{|Y|^q}{E[|Y|^q]}\right]= \frac1p \frac{E[|X|^p]}{E[|X|^p]}+\frac1q \frac{E[|Y|^q]}{E[|Y|^q]}=\frac1p+\frac1q=1\] and so we have
\[1\ge E\left[\frac{|X|}{E[|X|^p]^{1/p}} \frac{|Y|}{(E[|Y|^q])^{1/q}}\right]= \frac{E[|XY|]}{E[|X|^p]^{1/p}(E[|Y|^q])^{1/q}} \\ E[|XY|]\le E[|X|^p]^{1/p}(E[|Y|^q])^{1/q}\]
This is actually a generalization of the Cauchy-Schwartz inequality we discussed earlier: if \(p=q=2\) we get
\[|E[XY]| \le E[|XY|] \le \sqrt{E[X^2]}\sqrt{E[Y^2]}\]
These inequalities are stated here in terms of expectations, but they hold in general for sums and integrals as well.
Some useful special cases are:
If we set \(Y=1\) we get
\[E[|X|] \le \left\{E\left[|X|^p\right]\right\}^{1/p}, 1<p< \infty\]
For \(1<r<p\), if we replace \(|X|\) by \(|X|^r\), we get
\[E[|X|^r] \le \left\{E\left[|X|^{pr}\right]\right\}^{1/p}\]
and writing \(s=pr\) (which implies \(s>r\)) we get
Liapunov’s Inequality
\[(E[|X|^r])^{1/r} \le (E[|X|^s])^{1/s}\]
for \(1<r<s< \infty\)
If a rv \(X\) has a finite \(k^{th}\) moment, it has a finite \(j^{th}\) moment for all \(j \le k\).
By Liapunovs inequality we have
\[E[|X|^j] \le (E[|X|^k])^{j/k} < \infty\]
The inequalities above are not really probability theory but are inequalities from real analysis. Next we consider a new type of inequality true specifically in probability theory:
Markov’s Inequality
If \(X\) takes on only non negative values, then for any \(a>0\)
\[P(X>a)<\frac{E[X]}{a}\] proof if X is a continuous rv with density \(f\), then
\[ \begin{aligned} &P(X>a) = \int_a^\infty f(x) dx\le\\ &\int_a^\infty \frac{x}{a} f(x) dx = \\ &\frac1a \int_a^\infty x f(x) dx \le \\ &\frac1a \int_0^\infty x f(x) dx = \\ &\frac1a \int_{-\infty}^\infty x f(x) dx = \frac{E[X]}{a} \\ \end{aligned} \]
Markov’s inequality is deceptively simple but, if used well, can be very powerful. To start it implies what is perhaps the most famous inequality in probability:
Chebyshev’s Inequality
If \(X\) is a r.v. with mean \(\mu\) and variance \(\sigma^2\), then for any \(a>0\):
\[P(|X-\mu|>a\sigma)\le\frac{1}{a^2}\] proof
Clearly \((X-\mu)^2\) is a non-negative rv, so
\[ \begin{aligned} &P(|X-\mu|>a\sigma) = \\ &P((X-\mu)^2>a^2\sigma^2) \le \\ &\frac{E[(X-\mu)^2]}{a^2\sigma^2} = \\ &\frac{\sigma^2}{a^2\sigma^2}=\frac{1}{a^2} \end{aligned} \]
If \(X\) is a r.v. with mean \(\mu\) and variance \(\sigma^2\), then for any \(a>0\):
Here we used Markov’s inequality with the function \(g(x) = (x-\mu)^2\). An easy extension is this:
Say \(X\) is a non-negative rv and assume \(E[X^{p}]<\infty\). Then
\[P(X \ge t)\le\frac{E[X^{p}]}{t^{p}}\]
proof
\[P(X\ge t) = P(X^{p}\ge t^{p}) \le \frac{E[X^{p}]}{t^{p}}\]
Notice that the \(p\) only appears on the right hand side, so this inequality holds for all \(p\) for which the \(E[X^{p}]<\infty\). Therefore it also holds for the \(p\) that minimizes the right hand side.
Remarks
Often the theorem is stated in terms of the moments of \(X\): \[P(X\ge t) \le \frac{E[X^{k}]}{t^{k}}=\frac{\mu_k}{t^k}\]
Even so the theorem requires a non-negative random variable it can often be used for others as well. For example say \(X\) is a random variable with a density symmetric around 0 and we want a bound on \(P(X>t)\), then we have \(P(X>t)=\frac12 P(|X|>t)\). Now \(|X|\) is a non-negative rv and \(E[|X|^{2k}]=E[X^{2k}]\), so we can find the moment bounds for even moments. Another option sometimes is to use \(Y=X|X>0\).
Say \(X\sim N(0,1)\). We have seen previously that \(E[X^k]=0\) if k is odd and
\[E[X^{2k}] =\prod_{i=1}^{k}(2i-1)\]
Let’s try and find an upper bound for \(P(|X|>5)\):
So the moment bounds are
\[E[X^{2k}]/t^{2k}=\frac1{t^{2k}}\prod_{j=1}^{k}(2j-1)\]
We want a bound on \(P(X>5)\), so it becomes
\[\left[\prod_{j=1}^{k}(2j-1)\right]/5^{2k}=\frac1{25^{k}}\prod_{j=1}^{k}(2j-1)\]
What value of k minimizes this expression? Let’s call the bound \(g(k)\), then we have
\[ \begin{aligned} &g(k+1) = \frac1{25^{k+1}}\prod_{j=1}^{k+1}(2j-1) = \\ &\frac1{25}(2(k+1)-1) \frac1{25^{k}}\prod_{j=1}^{k}(2j-1) = \frac{2k+1}{25}g(k)\\ &\frac{g(k+1)}{g(k)} = \frac{2k+1}{25} > 1 \text{ iff } k>12 \end{aligned} \]
and so g is decreasing from k=1,..,12 and increasing from then on, and therefore has a minimum at k=12.
Or we could use R:
k=2*1:20
up=rep(0, 20)
for(i in 1:20) up[i]=prod(2*(1:i)-1)/5^(2*i)
plot(k, log(up), xlab="2k")
kmin=k[up==min(up)][1]
abline(v=kmin)
c(kmin, up[kmin/2])
## [1] 2.400000e+01 5.305529e-06
where we do the graph on the log-scale because this is easier to read.
So the upper bound has a minimum at \(2k=24\) with \(P(|X|>5)<5.3\times 10^{-6}\). Of course we can calculate this exactly:
2*(1-pnorm(5))
## [1] 5.733031e-07
Consider how much better this is than Chebysheff’s inequality, which would give
up[1]
## [1] 0.04
Note Inequalities for terms of the form \(|X-\mu|>t\) are called concentration bounds.
There is also a way to extend Markov’s inequality using the moment generating function:
Chernoff Bounds
Let \(X\) be a rv with moment generating function \(\psi (t) = E[e^{tX}]\). Then for any a>0
\[P(X \ge a) \le e^{-ta} \psi (t)\] for all \(t>0\).
proof
For \(t>0\)
\[ \begin{aligned} &P(X \ge a) = \\ &P(e^{tX} \ge e^{ta}) \le \\ &\frac{E[e^{tX}]}{e^{ta}} = e^{-ta}\psi(t)\\ \end{aligned} \]
Again the \(t\) appears only on the right side of the inequality, and therefore it holds for all t. We have then
\[P(X>a)\le \inf_{t>0} \left\{e^{-ta} \psi (t)\right\}\] We will call this the best Chernoff bound. In some textbooks this is what is called the Chernoff bound.
As we know a random variable that has a moment generating function that is finite in an open neighborhood of 0 has all its moments, that \(E[|X|^k]< \infty\) for all k. So this is a rather strong condition, and therefore leads to very good bounds.
say \(Z \sim N(0,1)\), then
\[ \begin{aligned} &P(Z\ge a) \le e^{-ta}\psi_Z(t) =\exp \left\{t^2/2-ta \right\} \\ &\frac{d}{dt} \exp \left\{t^2/2-ta \right\} =\exp \left\{t^2/2-ta \right\}(t-a) =0 \\ &P(Z\ge a) \le \exp \left\{a^2/2-a\times a \right\} = e^{-a^2/2}\\ \end{aligned} \]
which is a very useful upper bound on the tail probabilities of a standard normal random variable.
Returning to the example above we now find an upper bound for \(P(|X|>5)\)
\[P(|X|>5)=2P(X>5)=2\exp(-5^2/2)\]
2*exp(-5^2/2)
## [1] 7.453306e-06
which is slightly worse than the moment bound. In fact, we have
Let \(X\) be a rv with a moment generating function \(\psi\) that is finite in an open interval around 0. Then
\[\inf_{k} \left\{E[X^k]/t^{k}\right\}\le \inf_t \left\{e^{-ta}E[e^{tX}]\right\}\]
so the best moment bound is always at least as good as the best Chernoff bound.
Why would we use the Chernoff bounds if the moment bounds are always better? One reason is that the Chernoff bounds don’t need non-negative random variables. Another one is this
Let \(X_1,..,X_n\) be iid random variables with moment generating function \(\psi\), and let \(S_n=X_1+..+X_n\). Then
\[P(S_n>t)\le e^{-ta}\psi(t)^n\]
for all a.
proof
\[\psi_{S_n}(t)=\psi(t)^n\]
so Chernoff bounds extend easily to sums of random variables.
say \(Z_1,...,Z_n\) are iid \(N(0,1)\), and \(S_n=Z_1+...+Z_n\), then
\[ \begin{aligned} &P(S_n/\sqrt n\ge a) =\\ &P(S_n\ge \sqrt n a) \le \\ &e^{-t\sqrt n a}\psi_Z(t)^n =\\ &e^{-t\sqrt n a}(\exp \left\{t^2/2\right\})^n=\\ &\exp \left\{nt^2/2-t\sqrt n a\right\} \\ &\\ &\frac{d}{dt} \exp \left\{nt^2/2-t\sqrt n a\right\} =\\ &\exp \left\{nt^2/2-t\sqrt n a \right\}(nt-\sqrt n a) =0 \\ &\\ &t=a/\sqrt n\\ &\\ &P(S_n/\sqrt n\ge a) \le \\ &\exp \left\{n(a/\sqrt n)^2/2-(a/\sqrt n)\sqrt n\times a \right\} = e^{-a^2/2}\\ \end{aligned} \]
Of course we already knew this because \(S_n/\sqrt n\sim N(0,1)\) again.
For the last inequality first recall
A function g is said to be convex if for all x and y and \(0< \lambda <1\)
\[g( \lambda x+(1- \lambda )y) \le \lambda g(x)+(1- \lambda )g(y)\]
Jensen’s Inequality
For any rv \(X\), if g is a convex function we have
\[E[g(X)] \ge g(E[X])\]
proof
let \(l(x)\) be a tangent line to \(g(x)\) at the point \(g(E[X])\). Write \(l(x)=a+bx\) for some a and b. Now by the convexity of g we have
\[g(x) \ge a+bx\]
and so
\[E[g(X)] \ge E[a+bX]=a+bE[X]=l(E[X])=g(E[X])\]
For any rv \(X\), if g is a concave function we have
\[Eg([X]) \le g(E[X])\] proof
Of course if g is a concave function, the -g is convex and we have
\[E[g([X])] = -E[-g([X])] \le -(-g(E[X]) = g(E[X])\]
\(g(x)=x^2\) is convex, and so
\[E[X^2] \ge E[X]^2\]
which implies
\[var(X) = E[X^2]-E[X]^2 \ge 0\]
If \(x>0\) \(g(x)=1/x\) is convex, so \(E[1/X] \ge 1/E[X]\)
\(g(x)=e^{tx}\) is convex for all \(t>0\), so
\[\psi(t) = E[e^{tX}] \ge e^{tE[X]}=e^{\mu t}\] Say \(Z\sim N(0,1)\), then
\[\psi(t) = e^{t^2/2}\ge e^{0t}=1\]
for all \(t>0\).
Let \(x_1,..,x_n\) be a set of real numbers. Then
\[\bar{x}=\frac1n \sum_{i=1}^n x_i\]
\[\tilde{x}=\left( \prod_{i=1}^n x_i\right)^{1/n}\]
n=20
x=round(sort(runif(n, 0, 100)))
x
## [1] 5 11 15 17 18 19 23 24 44 49 52 54 66 68 71 72 75 82 82 85
sum(x)/n
## [1] 46.6
prod(x)^(1/n)
## [1] 36.29805
We can use Jensen’s inequality to prove a famous result from arithmetic:
arithmetic mean \(\ge\) geometric mean
We always have
\[\bar{x}\ge \tilde{x}\] proof consider the function \(g(x)=-\log x\), which is convex. Let \(X\) be a discrete uniform random variable on the set \(\{x_1,...,x_n\}\). Then by Jensen’s inequality
\[E[-\log X]\ge-\log E[X]\] however
\[ \begin{aligned} &E[X] = \sum_{i=1}^n x_i\frac1n = \bar{x}\\ &E[-\log X] = \sum_{i=1}^n -\log x_i\frac1n = -\log \tilde{x} \\ &-\log \tilde{x}\ge -\log \bar{x} \end{aligned} \]