Distributions Arising in Statistics

In this chapter we briefly discuss some distributions that often come up in Statistics.

Chisquare Distribution

Definition

a random variable X is said to have a chisquare distribution with n degrees of freedom, X~χ²(n), if it has density

Of course we have X~Γ(n/2,2)

Say Z~N(0,1) and let X=Z². then if x>0

so X~χ²(1)

We have the following properties of a χ²:

Theorem

Say X~χ²(n), Y~χ²(m) and X and Y are independent. Then

From this theorem it follows that if Z₁,..,Z_n are iid N(0,1), then ∑Z_i²~χ²(n)

Definition

Say X₁, .., X_n are a sample, then the sample variance is defined by

Theorem

Say X₁, .., X_n are iid N(μ,σ), then (n-1)S²/σ²~χ²(n-1)

Note: we use "n-1" instead of "n" because then S² is an unbiased estimator of σ², that is E[S²]=σ²

Note: another important feature here is thatX̅ S²

Student's t Distribution (by W.S. Gosset)

Definition

Say X~N(0,1), Y~χ²(n) and X Y. Then

has a Student's t distribution with n degrees of freedom, T_n~t(n), that is

Note T_n → N(0,1) in distribution

We have ET_n=0 if n>1 (and does not exist if n=1) and VT_n=n/(n-2) if n>2 (and does not exist if n≤2)

The importance of this distribution in Statistics comes from the following:

Theorem

say X₁, .., X_n are iid N(μ,σ). Then

Note: S is of course an estimate of the population standard deviation, so this formula tries to standardize the sample mean without knowing the exact standard deviation.

An important special case is X~t(1). This is also called the Cauchy distribution. Notice it has no finite mean (and of course then also no finite variance). It has density

Snedecor's F distribution

Definition

X is said to have an f distribution with n and m degrees of freedm, X~F(n,m) if

Theorem

Say X~χ²(n), Y~χ²(m) and X and Y are independent. Then (X/n)/(Y/m)~F(n,m)

We have EF = m/(m-2) (no mention of n !)

Theorem

Say X₁, .., X_n are iid N(μ_x,σ_x) and Y₁, .., Y_m are iid N(μ_y,σ_y). Furthermore X_i

Y_j for all i and j. Then

Order Statistics

Many statistical methods, for example the median and the range, are based on an ordered data set. In this section we study some of the common distributions of order statistics.

One of the difficulties when dealing with order statistics are ties, that is the same observation appearing more than once. This should only occur for discrete data because for continuous data the probabiltity of a tie is zero. They may happen anyway because of rounding, but we will ignore them in what follows.

Say X₁, .., X_n are iid with density f. Then X_(i) is the i^th order statistics if X₍₁₎< ... < X_(i) < ... <X_(n)

Note X₍₁₎ = min {X_i} and X_(n) = max {X_i}

Let's find the pdf of X_(i). For this let Y be a r.v. that counts the number of X_j ≤ x for some fixed number x. We can think of Y as the number of "successes" of n independent Bernoulli trials with success probability p = P(X_i ≤ x) = F(x) for i=1,..,n. So Y~B(n,F(x)). Note also that the event {Y≥i} means that more than i observations are less or equal to x, so the i^th largest is less or equal to x. Therefore

taking derivatives one can show that

Example : Say X₁, .., X_n are iid U[0,1]. Then for 0<x<1 we have f(x)=1 and F(x)=x. Therefore

Empirical Distibution Function

The empirical distribution function of a sample X₁, .., X_n is defined as follows:

so it is the sample equivalent of the regular distribution function:

• F(x)=P(X≤x) is the probability that the rv X≤x

• F̂(x) is the proportion of X₁, .., X_n ≤x

The empirical distribution function is very important in Statistics.

Example: say we have data

0.36 0.37 0.37 0.46 0.47 0.52 0.54 0.67 0.96 0.98

then the edf is

Here is the edf of a random sample of 100 from a N(0,1), together with the true cdf: