Standard Probability Distributions

Discrete Distributions

Discrete Uniform

Let \(N \ge 2\) be an integer and consider the rv X that chooses a number from 1 to N with equal probability, that is

\[P(X=k)=1/N\text{ for }1 \le k \le N\]

Then

\[ \begin{aligned} &E[X] =\sum_{k=1}^N x\frac1N=\frac1N\frac{N(N+1)}2=\frac{N+1}2 \\ &E[X^2] =\sum_{k=1}^N x^2\frac1N=\frac1N\frac{N(N+1)(2N+1)}6=\frac{(N+1)(2N+1)}6 \\ &var(X) = \frac{(N+1)(2N+1)}6-\left(\frac{N+1}2\right)^2=\\ &\frac1{12} (N+1)\left[4N+2-3(N+1) \right] = \\ &\frac1{12} (N+1)(N-1) = \frac{N^2-1}{12}\\ \end{aligned} \]

General Discrete RV on a Finite Set

Let \(N \ge 2\) be an integer and consider the rv X with

\[P(X=x_k)=p_k\text{ for }1 \le k \le N\]

Nothing more can be said until the \(x_k\) and \(p_k\) are specified.

Bernoulli Distribution

A r.v. \(X\) is said have a Bernoulli distribution with success parameter \(p\) iff

\[P(X=1)=1-P(X=0)=p\]

Sometimes we call the outcomes “success”(=1) and “failure”(=0)

Note: often we use \(q = 1-p\)

Shorthand: \(X \sim\) Ber(p)

\[ \begin{aligned} &E[X] = 0*q+1*p = p \\ &E[X^2] = 0^2*q+1^2*p = p \\ &var(X) = E[X^2]-(E[X])^2 = p-p^2 = pq \\ &\psi (t) = E[\exp(tX)] = \exp(t0)q+\exp(t1)p = q+e^tp \end{aligned} \]

Binomial Distribution

Say \(Y_1, ... , Y_n\) are iid Ber(p) and let \(X=Y_1+...+Y_n\), then \(X\) is said to have a binomial distribution with parameters n and p. (\(X \sim Bin(n,p)\)).

We have

\[P(X=k)={n\choose k}p^k(1-p)^{n-k}\]

because “X=k” means k successes and n-k failures. Any specific sequence of k successes and n-k failures has probability \(p^k(1-p)^{n-k}\), and there are \({n\choose k}\) such sequences.

It is easy to see that this defines a proper pdf:

\[\sum_{k=0}^n {n\choose k}p^k(1-p)^{n-k}=\left(p+1-p\right)^n=1\]

which also explains the name binomial.

For the mean and variance we have

\[ \begin{aligned} &E[X] = E[\sum_{i=1}^n Y_i]=\sum_{i=1}^n E[Y_i]=np\\ &var(X) = var\left(\sum_{i=1}^n Y_i\right)=\sum_{i=1}^n var(Y_i)=np(1-p) \end{aligned} \] In (1.10.3) we saw that the moment generating function is given by

\[\psi(t)=\frac{pe^{t}}{1-e^{t}(1-p)}\]

Example (2.1.1)

A company wants to hire 5 new employees. From previous experience they know that about 1 in 10 applicants are suitable for the jobs. What is the probability that if they interview 20 applicants they will be able to fill those 5 positions?

Consider each interview a “trial” with the only two possible outcomes: “success” (can be hired) or “failure” (not suitable). Assumptions:

  1. “success probability” is the same for all applicants (as long as we know nothing else about them this is ok.)

  2. trials are independent (depends somewhat on the setup of the interviews but should be ok)

then if we let X = “#number of suitable applicants in the group of 20” we have \(X \sim Bin(20,0.1)\) and we find

\[ \begin{aligned} &P(X\ge 5) =1- P(X\le 4) = \\ &1-\left[{20\choose 0}0.1^00.9^{20}+{20\choose 1}0.1^10.9^{19}+{20\choose 2}0.1^20.9^{18}\right.+\\ &\left.{20\choose 3}0.1^30.9^{17}+{20\choose 4}0.1^40.9^{16}\right]\\ &1-0.9586 = 0.0432\\ \end{aligned} \] or

1-pbinom(4, 20, 0.1)
## [1] 0.0431745

Geometric Distribution

Say \(Y_1 , Y_2 , ...\) are iid Ber(p) and let \(X\) be the number of trials needed until the first success. Then X is said to have a geometric distribution with rate p (\(X \sim G(p)\) ), and we have

\[P(X=k)=p(1-p)^{k-1}\] \(k=1,2,...\)

Note: sometime the geometric is defined as the number of failures before the first success. Clearly this is then \(Y=X-1\).

This defines a proper pdf:

\[\sum_{k=1}^\infty pq^{k-1} = p \sum_{l=0}^\infty q^{l}=\frac{p}{1-q}=1\]

For the mean and variance we have

\[ \begin{aligned} &E[X] = \sum_{k=1}^\infty kpq^{k-1}=\\ &p\sum_{k=1}^\infty kt^{k-1}|_{t=q}=\\ &p\sum_{k=1}^\infty \frac{d t^{k}}{dt}|_{t=q} = \\ &p\frac{d}{dt} \sum_{k=1}^\infty t^{k}|_{t=q}=\\ &p\frac{d}{dt} \frac{t}{1-t}|_{t=q}=\\ &p\frac{1}{(1-t)^2}|_{t=q}=\\ &\frac{p}{p^2}=\frac1p \end{aligned} \] the same trick works for the variance:

\[ \begin{aligned} &E[X(X-1)] = \sum_{k=1}^\infty k(k-1)pq^{k-1}=\\ &pq\sum_{k=2}^\infty \frac{d^2 t^{k}}{dt^2}|_{t=q} = \\ &pq\frac{d^2}{dt^2} \sum_{k=2}^\infty t^{k}|_{t=q}=\\ &pq\frac{d^2}{dt^2} \left(\sum_{k=1}^\infty t^{k}-t\right)|_{t=q}=\\ &pq\frac{d^2}{dt^2} \left(\frac{t}{1-t}-t\right)|_{t=q}=\\ &pq\frac2{(1-t)^3}|_{t=q}=\\ &\frac{2pq}{p^3}=\frac{2q}{p^2} \end{aligned} \] and so

\[ \begin{aligned} &var(X)=E[X^2]-E[X]^2=\\ &E[X(X-1)]+E[X]-E[X]^2=\\ &\frac{2q}{p^2} +\frac{1}{p}-(\frac{1}{p})^2 = \frac{q}{p^2} \end{aligned} \] for the moment generating function we find

\[ \begin{aligned} &\psi(t) =\sum_{k=1}^\infty e^{tk}pq^{k-1}= \\ &pe^{t}\sum_{k=1}^\infty (e^tq)^{k-1} = \\ &pe^{t}\sum_{l=0}^\infty (e^tq)^{l} = \\ &\frac{pe^{t}}{1-qe^{t}} \end{aligned} \] Here is a neat calculation of the mean:

\[ \begin{aligned} &\mu = \sum_{k=1}^\infty k pq^{k-1} = p+2pq+3pq^2+ ..\\ &q\mu = pq+2pq^2+3pq^3+ ..\\ &\mu-q\mu = p+(2pq-pq)+(3pq^2-2pq^2) + ... = \\ &p+pq+pq^2+ .. = p\sum_0^\infty q^k=\frac{p}{1-q}=\frac{p}{p}=1\\ &1=\mu(1-q)=\mu p\\ &\mu=1/p \end{aligned} \]

Example (2.1.2)

(same as above) How many applicants will the company need to interview to be 95% sure to be able to fill at least one of the five positions?

If we let Y be the number of trials until the first success (= an applicant is suitable) we have \(Y \sim G(0.1)\). Then

\[ \begin{aligned} &F_Y(n) = \sum_{k=1}^n pq^{k-1}=p\sum_{l=0}^{n-1}q^l=p\frac{1-q^n}{1-q}=1-q^n \\ &0.95=1-q^n \\ &q^n=0.05 \\ &n\log q=\log0.05\\ &n=\log0.05/\log q = \log0.05/\log 0.9 = 28.4\\ \end{aligned} \]

In general the geometric rv. is a model for “lifetimes” or “times until failure” of components, that is for the number of time periods until a component fails. But how do we know in real live whether the geometric might be a good model for a specific case? The next theorem helps:

Theorem (2.1.3)

Say \(X\) is a discrete rv. on \(\{1,2,3,..\}\). Then

\[P(X>k)=P(X>k+j|X>j)\] for all k and j iff \(X \sim G(p)\).

Note \(P(X>k)=P(X>k+j|X>j)\) for all k and j is called the memoryless property, and the theorem states that for discrete rv.s on the positive integers this property is unique to the geometric rv.

proof

Say \(X \sim G(p)\), then

\[ \begin{aligned} &P(X>k) = 1-P(X \le k) = q^k \\ &P(X>k+j|X>j) = \\ &\frac{P(X>k+j,X>j)}{P(X>j)}=\\ &\frac{P(X>k)}{P(X>j)}=\\ &\frac{q^{k+j}}{q^j}=q^{k}=P(X>k)\\ \end{aligned} \] now assume \(X \in \{1,2,..\}\) has the memoryless property. Let the event \(A=\{X>1\}\), then

\[ \begin{aligned} &P(X>k+1) =\\ &P(X>k+1|A)P(A)+P(X>k+1|A^c)P(A^c)= \\ &P(X>k+1|X>1)P(X>1)+P(X>k+1|X=1)P(X=1) = \\ &P(X>k+1|X>1)P(X>1)+0=\\ &P(X>k)P(X>1) \end{aligned} \] by the memoryless property with \(j=1\). Now set \(q=P(X>1)\) and then for \(k>1\)

\[P(X>k)=qP(X>k-1)=...=q^{k-1}P(X>1)=q^k\]

So the geometric is a reasonable model if it is reasonable to assume an experiment has the memoryless property.

Example (2.1.4)

Say we want to model the number of days until a light bulb burns out. Is the geometric a good model for this? The question is whether the number of days has the memoryless property?

Example (2.1.5)

Say we want to model the number of years until a person dies. Is the geometric a good model for this? The question is whether the number of years has the memoryless property?

Negative Binomial Distribution

Despite the different name this is actually a generalization of the geometric, namely where \(X\) is the number of trials needed until the \(r^{th}\) success. (\(X \sim NB(p,r)\)).

The pdf is given by

\[P(X=k)={{k-1}\choose{r-1}}p^rq^{k-r}; k=r,r+1,...\]

because we have to have r successes (with probability p each) and k-r failures (with probability q). Moreover the r-1 successes (before the last) one can be at any point during the k-1 trials (before the last one).

As an alternative definition one often uses the number of failures until the first success, \(Y=X-r\).

Note that

\[{{k+r-1}\choose{r-1}}={{k+r-1}\choose{k}}\]

and so

\[P(Y=k) = P(X=k+r) = {{k+r-1}\choose{r-1}}p^rq^{k}=\] \[{{k+r-1}\choose{k}}p^rq^{k};k=0,1,...\] This form of the negative binomial distribution is also sometimes called the Pascal distribution.


Does this define a proper pdf? To show this takes a bit of work. First we need an extension of binomial coefficients:

Recall that if n and k are integers we have

\[{n \choose k}=\frac{n!}{(n-k)!k!}=\frac{n(n-1)..(n-k+1)}{k!}\]

Now let \(\alpha\) be any real number and define

\[{\alpha \choose k}=\frac{\alpha(\alpha-1)..(\alpha-k+1)}{k!}\]

if \(\alpha \ge k\) and 0 otherwise.

Next we need the Taylor series expansion of \((1+x)^\alpha\) at \(x=0\):

\[ \begin{aligned} &f(x) =(1+x)^\alpha \\ &\frac{df(x)}{dx}\vert_{x=0} =\alpha(1+x)^{\alpha-1}\vert_{x=0}=\alpha \\ &\frac{d^2f(x)}{dx^2}\vert_{x=0} =\alpha(\alpha-1)(1+x)^{\alpha-2}\vert_{x=0}=\alpha(\alpha-1) \\ & \\ &\frac{d^kf(x)}{dx^k}\vert_{x=0} =\prod_{r=0}^{k-1}(\alpha-r) \\ &\\ &(1+x)^\alpha=\sum_{k=0}^\infty\left[\prod_{r=0}^{k-1}(\alpha-r)\right]\frac{x^k}{k!}=\\ &\sum_{k=0}^\infty\left[\frac{\alpha(\alpha-1)..(\alpha-k+1)}{k!}\right]x^k=\\ &\sum_{k=0}^\infty{\alpha \choose k}x^k \end{aligned} \]

This is called the binomial series and is correct for any (even complex) number \(\alpha\). Notice that it is of course a generalization of the binomial formula for non-integer powers:

\[(1+x)^n=\sum_{k=0}^n{n \choose k}x^k\]

Using this expansion we find

\[ \begin{aligned} &{{k+r-1}\choose{k}} = \frac{(k+r-1)(k+r-2)...r}{k!}=\\ &(-1)^k\frac{(-r)(-r-1)...(-r-k+1)}{k!} =(-1)^k{{-r}\choose k} \end{aligned} \]

\[ \begin{aligned} &p^{-r} =(1-q)^{-r} = \\ &\sum_{k=0}^\infty{-r \choose k}(-q)^k = \\ &\sum_{k=0}^\infty{{k+r-1} \choose k}q^k \\ &\\ &1=p^r\left[\sum_{k=0}^\infty{{k+r-1} \choose k}q^k\right]=\sum_{k=0}^\infty{{k+r-1} \choose k}p^rq^k \end{aligned} \] and here we also have an explanation for the name “negative binomial”!

Theorem (2.1.6)

Let \(Y_1 ,..,Y_r \sim G(p)\) and independent, the \(X_r=Y_1 +..+Y_r \sim NB(p,r)\).

proof (by induction)

if \(r=1\) \(X_1=Y_1\) , so

\[P(X_1=k)=P(Y_1=k)=pq^{k-1}={{k-1} \choose {1-1}}p^1q^{k-1}\]

say the assertion is true for r, then

\[ \begin{aligned} &P(X_{r+1}=k) = P(\sum_{j=1}^{r+1}Y_j=k)=\\ &P(\sum_{j=1}^{r}Y_j+Y_{r+1}=k) = \\ &\sum_{l=1}^{k-1}P(\sum_{j=1}^{r}Y_j+Y_{r+1}=k|Y_{r+1}=l)P(Y_{r+1}=l) = \\ &\sum_{l=1}^{k-1}P(X_r=k-l)P(Y_{r+1}=l) = \\ &\sum_{l=1}^{k-1}{{k-l+r-1} \choose {k}}p^rq^{k-l-r}pq^{l-1}=\\ &\left[{{k+r} \choose {k}}\right]p^{r+1}q^{k-(r+1)} \end{aligned} \]

Note that we don’t need to worry about the constant term, it has to be what it needs to be to make this a proper random variable!


For the mean and variance we find easily

\[ \begin{aligned} &E[X] = E\left[\sum_{j=1}^{r}Y_j\right]=\sum_{j=1}^{r}E[Y_j]=\frac{r}{p}\\ &var(X) = \sum_{j=1}^{r}var(Y_j)=\frac{rq}{p^2}\\ &\psi_X(t) = \left[\psi_{Y_1}(t)\right]^r =\left(\frac{e^{tp}}{1-e^{tq}}\right)^r\\ \end{aligned} \]

Next we have a theorem that connects the binomial and the negative binomial distributions:

Theorem:

Say \(X \sim Bin(n,p)\) and \(Y \sim NB(p,r)\). Then

\[F_X(r-1)=1-F_Y(n)\]

proof (probabilistic)

\(F_X(r-1) = P(X<r)\) is the probability of less than r successes in n trials

\(1-F_Y(n) = P(Y>n)\) is the probability of not having r successes in n trials

same thing!

Hypergeometric Distribution

One of the problems with the use of the binomial distribution in real life is that most sampling is done in such a way that the same object can not be selected a second time.

Example (2.1.7)

In a survey of 100 likely voters, 45 said they would vote for party AA.

Obviously this “selecting” would not allow the same person to be chosen twice.

If the selection is done without repetition, we get to the hypergeometric distribution. Of course, if the sample size is small compared to the population size the probabilities are almost the same.

In general the hypergeometric can be described as follows:

Definition (2.1.8)

Consider an urn containing N+M balls, of which N are white and M are black. If a sample of size n is chosen at random and if X is the number of white balls chosen, then X has a hypergeometric distribution with parameters (n,N,M).

\(X \sim HG(n,N,M)\)

We have

\[P(X=k) = \frac{{{N}\choose{k}}{{M}\choose{n-k}}}{ {{N+M}\choose{n}} }\]

because there are \({{N}\choose{k}}\) ways to select k objects from N without repetition and without order. Likewise there are \({{M}\choose{n-k}}\) selections of n-k out of M and \({{N+M}\choose{n}}\) selections of n objects from N+M.


Does this define a proper pdf? This is a consequence of

Theorem(2.1.8a)

Vandermonde

For any \(N,M,n \ge 0\) we have

\[{{N+M}\choose{n}} = \sum_{k=0}^n {{N}\choose{k}}{{M}\choose{n-k}}\]

We will give two very different proofs of this identity:

proof (Combinatorial)

Consider an urn with N white and M black balls. We randomly select n of these without order and without repetition. How many different arrangements are there? Clearly the answer is \({{N+M}\choose{n}}\)

How many arrangements are there if we want k white balls (and therefore also have to have n-k black balls)? Again, clearly the answer is \({{N}\choose{k}}{{M}\choose{n-k}}\) but the first selection is the the same as the second, where we allow k to be any number between 0 and n, and the identity follows.

proof (Algebraic)

First note that

\[ \begin{aligned} &\left(\sum_{i=0}^N a_ix^i\right)\left(\sum_{j=0}^M b_jx^j\right) = \\ &\left(\sum_{i,j} a_ib_jx^{i+j}\right) = \\ &a_0b_0+\left(a_0b_1+a_1b_0\right)x + \\ &\left(a_0b_2+a_1b_1+a_2b_0\right)x^2+ ... +\\ &\left(\sum_{k=0}^ra_kb_{r-k}\right)x^r+...+\\ &\left(\sum_{k=0}^{N+M}a_kb_{r-k}\right)x^{N+M} =\\ &\sum_{r=0}^{N+M}\left(\sum_{k=0}^ra_kb_{r-k}\right)x^r \end{aligned} \] where we define \(a_i=0\) for \(i=N+1,..,N+M\) and \(b_j =0\) for \(j=M+1,..,N+M\).

Now using the binomial formula we have

\[ \begin{aligned} &\sum_{r=0}^{N+M}{{N+M}\choose r}x^r = \\ &(1+x)^{N+M} = \\ &(1+x)^{N}(1+x)^{M} = \\ &\left(\sum_{i=0}^N {N\choose i}x^i\right)\left(\sum_{j=0}^M {M\choose j}x^j\right) =\\ &\sum_{r=0}^{N+M}\left(\sum_{k=0}^r{N\choose i}{M\choose {r-i}}\right)x^r \end{aligned} \]

and the identity follows from the fact that if two polynomials are equal for all x they have to have the same coefficients.

This identity is named after the French mathematician [Alexandre Theofile Vandermonde](https://en.wikipedia.org/wiki/Alexandre-Th%C3%A9ophile_Vandermonde (1772), famous mostly for his matrix. It really should be named after Zhu_Shijie who invented it much earlier in 1303.


To find the expected value of X we need the following identity:

\[{{n}\choose{k}} = \frac{n}{k} {{n-1}\choose{k-1}}\] and so

\[ \begin{aligned} &E[X] =\sum_{k=0}^N k \frac{{{N}\choose{k}}{{M}\choose{n-k}}}{{{N+M}\choose{n}}}= \\ &\sum_{k=1}^N k \frac{\frac{N}{k}{{N-1}\choose{k-1}}{{M}\choose{(n-1)-(k-1)}}}{\frac{N+M}{n}{{N+M-1}\choose{n-1}}} = \\ &\frac{nN}{N+M}\sum_{k=1}^N \frac{{{N-1}\choose{k-1}}{{M}\choose{(n-1)-(k-1)}}}{{{N+M-1}\choose{n-1}}} = \\ &\frac{nN}{N+M}\sum_{l=0}^{N-1} \frac{{{N-1}\choose{l}}{{M}\choose{(n-1)-l}}}{{{N+M-1}\choose{n-1}}} = \\ &\frac{nN}{N+M} \end{aligned} \]

similarly we find

\[var(X)=\frac{nNM}{(N+M)^2(1-\frac{n-1}{N+M-1})}\]

Note that as the population size gets large we find if \(\frac{N}{N+M}\rightarrow p\), then

\[ \begin{aligned} &E[X] = \frac{nN}{N+M}\rightarrow np\\ &var(X)=\frac{nNM}{(N+M)^2(1-\frac{n-1}{N+M-1})} = \\ &n\frac{N}{N+M}\frac{M}{N+M}\frac{N+M-1}{N+M-1-(n-1)} = \\ &n\frac{N}{N+M}(1-\frac{N}{N+M})\left(\frac{N-1}{N+M-n}+1-\frac{N}{N+M-n}\right) \rightarrow \\ &np(1-p)(p+1-p)=np(1-p) \end{aligned} \] so as one would expect the mean and the variance approach those of the binomial distribution as the population size gets large. As a ballpark one uses the hypergeometric if the sample size is more than \(10\%\) of the population size.

Example (2.1.9)

say our company has a pool of 100 candidates for the job, 10 of whom are suitable for hiring. If they interview 50 of the 100, what is the probability that they will fill the 5 positions?

Here \(X \sim HG(50,10,90)\) and so

\[P(X \ge 5) = 1- P(X \le 4) = 1 - 0.3703 = 0.6297\]

using the binomial distribution for our example we would have found \(P(X \ge 5) = 0.5688\), quite different from the hypergeometric. On the other hand if our candidate pool had 1000 applicants, 100 of whom are suitable we would have found \(P(X \ge 5) = 0.5731\).

Poisson Distribution

A random variable X is said to have a Poisson distribution with rate \(\lambda\), (\(X \sim P( \lambda )\)) if

\[P(X=k)=\frac{\lambda^k}{k!}e^{-\lambda};k=0,1,...\]

this defines a proper pdf:

\[\sum_{k=0}^\infty \frac{\lambda^k}{k!}e^{-\lambda}=e^{\lambda}e^{-\lambda}=1\] for the mean and variance we have

\[ \begin{aligned} &E[X] = \sum_{k=0}^\infty k \frac{\lambda^k}{k!}e^{-\lambda} = \\ &e^{-\lambda}\sum_{k=1}^\infty \frac{\lambda^k}{(k-1)!} = \\ &e^{-\lambda}\lambda\sum_{k=1}^\infty \frac{\lambda^{k-1}}{(k-1)!} = \\ &e^{-\lambda}\lambda\sum_{l=0}^\infty \frac{\lambda^{l}}{l!} = \\ &e^{-\lambda}\lambda e^{\lambda}=\lambda\\ &\\ &E[X(X-1)] = \sum_{k=0}^\infty k(k-1) \frac{\lambda^k}{k!}e^{-\lambda} = \\ &e^{-\lambda}\sum_{k=2}^\infty \frac{\lambda^k}{(k-2)!} = \\ &e^{-\lambda}\lambda^2\sum_{l=0}^\infty \frac{\lambda^{l}}{l!} = \\ &e^{-\lambda}\lambda^2 e^{\lambda}=\lambda^2\\ &\\ &var(X)=E[X(X-1)]+E[X]-E[X]^2=\lambda^2+\lambda-\lambda^2=\lambda \end{aligned} \]

and for the mgf we find

\[\psi(t)=\sum_{k=0}^\infty e^{tk} \frac{\lambda^k}{k!}e^{-\lambda}=\sum_{k=0}^\infty \frac{(e^t\lambda)^k}{k!}e^{-\lambda}=\exp \left\{ \lambda (e^t-1)\right\} \]

One way to visualize the Poisson distribution is as follows:

Theorem (2.1.10)

\(X\sim Bin(n,p)\) such that n is large and p is small. That is, the number of trials is large but the success probability is small. Then \(X\) is approximately Poisson with rate \(\lambda = np\).

We will show two proofs, each of them illustrating a useful method:

proof 1

let \(B \sim Bin(n,p)\) and \(P \sim Pois( \lambda )\). A well-known result in calculus is that if \(a_n\rightarrow a\), then

\[\left(1+\frac{a_n}{n}\right)^n\rightarrow e^a\] and so

\[ \begin{aligned} &\psi_B(t) = \left(q+pe^t \right)^n = \\ &\left(1-\frac{np}{n}+\frac{npe^t}{n} \right)^n = \\ &\left(1+\frac{np(e^t-1)}{n} \right)^n \rightarrow \\ &\exp \left\{ \lambda (e^t-1)\right\}=\psi_P(t) \end{aligned} \]

and the result follows from theorem (1.10.9)

proof 2

There are recursion relationships for both the binomial and the Poisson distributions:

\[ \begin{aligned} &P(B=x)= {{n}\choose{k}}p^xq^{n-x}= \frac{n!}{(n-x)!x!}p^xq^{n-x}= \\ &\frac{n-x+1}{x}\frac{p}{q}\frac{n!}{(n-x+1)!(x-1)!}p^{x-1}q^{n-x+1} = \\ &\frac{n-x+1}{x}\frac{p}{q}P(B=x-1) \end{aligned} \] and

\[P(P=x) = \frac{\lambda^x}{x!}e^{-\lambda}=\frac{\lambda}{x}\frac{\lambda^{x-1}}{(x-1)!}e^{-\lambda}=\frac{\lambda}{x}P(P=x-1)\] Now when \(np\rightarrow \lambda\) we find

\[ \begin{aligned} &\frac{n-x+1}{x}\frac{p}{q}=\frac{np-p(x+1)}{x-px}\rightarrow \frac{\lambda}{x}\\ &P(B=x) = \frac{\lambda}{x}P(B=x-1)\\ &P(B=0) = (1-p)^n = (1- \frac{np}{n})^n\rightarrow e^\lambda=P(P=0)\\ \end{aligned} \] so the approximation works for x=0, and then the recursion relationship assures that it works for all x as well.

Example (2.1.11)

say you drive from Mayaguez to San Juan. Assume that the probability that on one kilometer of highway there is a police car checking the speed is 0.04. What is the probability that you will encounter at least 3 police cars on your drive?

If we assume that the police cars appear independently (?) then X = # of police cars \(\sim Bin(180,0.04)\), so

\[P(X \ge 3) = 1 - \text{pbinom}(2,180,0.04) =1 - 0.0234 = 0.9766\]

One the other hand X is also approximately P(180*0.04) = P(7.2) and so

\[P(X \ge 3) = 1 - \text{ppois}(2,7.2) = 1 - 0.0254 = 0.9746\]

The main questions with approximations are always:

  1. how good is it?
  2. when does it work?

Here is another connection between the Poisson and the binomial distribution. To proof it we first need

Theorem

Let \(X \sim Pois( \lambda )\), \(Y \sim Pois( \mu )\) and \(X\) and \(Y\) independent. Then \(X+Y \sim Pois( \lambda + \mu )\).

proof

using moment generating functions we have

\[ \begin{aligned} &\psi_{X+Y}(t) = \psi_{X}(t)\psi_{Y}(t)=\\ &\exp \left\{ \lambda (e^t-1)\right\}\exp \left\{ \mu (e^t-1)\right\} = \\ &\exp \left\{ (\lambda+\mu) (e^t-1)\right\} \end{aligned} \]

Now:

Theorem (2.1.12)

Let \(X \sim Pois( \lambda ), Y \sim Pois( \mu )\) and X and Y independent. Then

\[X|X+Y=n \sim Bin\left(n, \frac{\lambda}{\lambda + \mu}\right)\]

proof:

\[ \begin{aligned} &P(X=k|X+Y=n) = \\ &\frac{P(X=k,X+Y=n)}{P(X+Y=n)} = \\ &\frac{P(X=k,Y=n-k)}{P(X+Y=n)} = \\ &\frac{P(X=k)P(Y=n-k)}{P(X+Y=n)} = \\ &\frac{\frac{\lambda^k}{k!}e^{-\lambda}\frac{\mu^{n-k}}{(n-k)!}e^{-\mu}}{\frac{(\lambda+\mu)^n}{n!}e^{-(\lambda+\mu)}} =\\ &\frac{n!}{(n-k)!k!}\frac{\lambda^k\mu^{n-k}}{(\lambda+\mu)^n} = \\ &{{n}\choose{k}}\left(\frac{\lambda}{\lambda+\mu}\right)^k\left(\frac{\mu}{\lambda+\mu}\right)^{n-k}=\\ &{{n}\choose{k}}\left(\frac{\lambda}{\lambda+\mu}\right)^k\left(1-\frac{\lambda}{\lambda+\mu}\right)^{n-k} \end{aligned} \]

Example (2.1.13)

say the number of men and women that come into a store during one day have Poisson distributions with rates 20 and 50, respectively. If a total of 100 people came to the store today, what is the probability that at most 25 were men?

If \(X\) is the number of men and \(Y\) the number of women, then

\[P(X\le 25|X+Y=100)=pbinom(25, 100, 20/(20+50))\]

pbinom(25, 100, 20/(20+50))
## [1] 0.2511095

Of course the assumption that men and women come to the store independently is probably questionable!

Multinomial Distribution

Let \(p_1 ,...,p_k\) be numbers with \(0 \le p_i \le 1\) and \(\sum p_i =1\). Then the random vector \((X_1 ,...,X_k)\) has a multinomial distribution with m trials if

\[P(X_1=x_1,...,X_k=x_k)=m!\prod_{i=1}^k\frac{p_i^{x_i}}{x_i!}\] if

\[x_i\in\{0,1,...\},\sum x_i=m\]

We write \((X_1 ,..,X_k) \sim M(m,p_1 ,..,p_k)\)

That this defines a proper pdf follows from the multinomial theorem:

\[\sum m!\prod_{i=1}^k\frac{p_i^{x_i}}{x_i!} = (p_1 +..+p_k)^m=1\]

Example (2.1.14)

we roll a fair die 100 times. Let \(X_1\) be the number of “1”s, \(X_2\) be the number of “2”s,.., \(X_6\) be the number of “6”s. Then

\[(X_1 ,..,X_6) \sim M(100,1/6,..,1/6)\]

Note: if k=2 we have \(x_1 +x_2~ =m\), or \(x_2 =m-x_1\) and \(p_1 +p_2 =1\), so

\[ \begin{aligned} &P(X_1=x_1,X_2=x_2)=\\ &m!\frac{p_1^{x_1}}{x_1!}\frac{p_2^{x_2}}{x_2!} = \\ &m!\frac{p_1^{x_1}}{x_1!}\frac{(1-p_1)^{m-x_1}}{(m-x_1)!} = \\ &{{m}\choose{x_1}}p_1^{x_1}(1-p_1)^{m-x_1} \end{aligned} \] and so \(X_1 \sim Bin(m,p_1 )\). The multinomial distribution is therefore a generalization of the binomial distribution where each trial has k possible outcomes.

Theorem (2.1.15)

Let \((X_1 ,..,X_k) \sim M(m,p_1 ,..,p_k)\). Then the marginal distribution of \(X_i\) is \(Bin(m,p_i)\).

proof:

let’s denote by

\[B_x =\{(x_1 ,..,x_{i-1} ,x_{i+1} ,..,x_k) : x_1 +..+x_{i-1} +x_{i+1} +..+x_k =m-x\}\]

then:

\[ \begin{aligned} &f_{X_i}(x) =\sum_{B_x} f(x_1 ,..,x_{i-1} ,x,x_{i+1} ,..,x_k) =\\ &\sum_{B_x} m!\frac{p_i^{x}}{x!}\prod_{j=1,j\ne i}^k\frac{p_j^{x_j}}{x_j!} = \\ &\sum_{B_x} m!\frac{p_i^{x}}{x!}\prod_{j=1,j\ne i}^k\frac{p_j^{x_j}}{x_j!}\frac{(1-p_i)^{m-x}}{(1-p_i)^{m-x}} = \\ &\frac{m!}{(m-x)!x!}p_i^x(1-p_i)^{m-x}\sum_{B_x} (m-x)!\prod_{j=1,j\ne i}^k\frac{(p_j/(1-p_i))^{x_j}}{x_j!}=\\ & {{m}\choose{m}} p_i^x(1-p_i)^{m-x} \end{aligned} \] where the sum is 1 because we summing over all possible values of a multinomial rv \((Y_1 ,..,Y_{k-1} ) \sim M(m-x,p_1 /(1-p_i ),..,p_k/(1-p_i )\), or because we use the multinomial theorem from calculus.

From this it follows that \(E[X_i]=mp_i\) and \(var(X_i)=mp_i(1-p_i)\)


Next we will find the moment generating function of a multinomial rv. Recall the definition of the mgf for random vectors:

\[\psi (t_1 ,..,t_k) = E[\exp(t_1 X_1 +..+t_k X_k )]\]

so

Theorem (2.1.17)

Say \((X_1 ,..,X_k) \sim M(m,p_1 ,..,p_k)\), then the mgf is given by

\[\left( e^{t_1}p_1+...+e^{t_k}p_k\right)^m\]

proof

\[ \begin{aligned} &\psi (t_1 ,..,t_k) = E[\exp(t_1 X_1 +..+t_k X_k )] = \\ &\sum \exp(t_1 x_1 +..+t_k x_k ) m!\prod_{i=1}^k\frac{p_i^{x_i}}{x_i!} = \\ &m!\sum \prod_{i=1}^k\frac{(e^{t_i}p_i)^{x_i}}{x_i!} = \\ &\left( e^{t_1}p_1+...+e^{t_k}p_k\right)^m \end{aligned} \]

Theorem (2.1.18)

Let \((X_1 ,..,X_k) \sim M(m,p_1 ,..,p_k)\). Then the conditional distribution of

\[(X_1 ,..,X_k)|X_i=x \sim M(m-x,p_1 /(1-p_i),..,p_k/(1-p_i)\]

proof

\[ \begin{aligned} &P(X_1=x_1 ,..,X_k=x_k|X_i=x_i) = \\ &\frac{P(X_1=x_1 ,..,X_k=x_k)}{P(X_i=x_i)} = \\ & \frac{m!\prod p_i^{x_i}/x_i!}{ {{m}\choose{x_i}}p_i^{x_i}(1-p_i)^{m-x_i} } = \\ &(m-x)!\prod_{j\ne i}^{n-x} \left(\frac{p_j}{1-p_i}\right)^{x_j}/x_j! \end{aligned} \]

Theorem (2.1.19)

Let \((X_1 ,..,X_k) \sim M(m,p_1 ,..,p_k)\). Then

\[cov(X_i ,X_j )=-mp_i p_j\]

proof

we will use the mgf:

\[ \begin{aligned} &E[X_iX_j] = \frac{\partial^2\psi(t_1,..,t_k)}{\partial_i\partial_j}|_{\pmb{t}=0} = \\ &\frac{\partial^2}{\partial_i\partial_j}\left(\sum p_le^{t_l}\right)^k|_{\pmb{t}=0} = \\ &\frac{\partial}{\partial_i}k\left(\sum p_le^{t_l}\right)^{k-1}p_je^{t_j}|_{\pmb{t}=0} = \\ &k(k-1)\left(\sum p_le^{t_l}\right)^{k-2}p_ie^{t_i}p_je^{t_j}|_{\pmb{t}=0} = \\ &k(k-1)p_ip_j\\ &\\ &cov(X_i,X_j) = k(k-1)p_ip_j-[kp_i][kp_j]=-kp_ip_j \end{aligned} \]

the fact that the covariance is always negative makes sense because if \(X_i\) is larger \(X_j\) is more likely to be small as the sum of the \(X_i\)’s has to be k.

Calculating the correlation we get

\[ \begin{aligned} &cor(X_i,X_j) = \frac{cov(X_i,X_j)}{\sqrt{var(X_i)var{X_j}}}= \\ &\frac{-kp_ip_j}{\sqrt{kp_i(1-p_i)kp_j(1-p_j)}} = \\ &-\sqrt{\frac{p_ip_j}{(1-p_i)(1-p_j)}} \end{aligned} \]

and the somewhat surprising fact that the correlation does not depend on k!