Experiments in Plant Hybridization (1865) by Gregor Mendel is one of the most famous papers in all of Science. His theory of genetics predicted that the number of smooth yellow, wrinkled yellow, smooth green and wrinkled green peas would be in the proportions 9:3:3:1. In one of his experiments he observed 315, 101, 108 and 32. Does this agree with his theory?
Again we use the likelihood ratio test. As with the test for independence we have a multinomial distribution Z with parameters (n, p1,..,pk) and we assume n is known. The mles are the same as before, zi/n.
Under the null hypothesis the proportions should be 9:3:3:1, so
\[H_0: p=(9/16, 3/16, 3/16, 1/16)\]
in this application the null hypothesis fixes the probabilities completely. Again we can do the Taylor approximation to -2log(LRT) and we find the chisquare statistic
\[X^2=\sum_{i=1}^k \frac{(O-E)^2}{E}\]
which has a chisquare distribution with 3 (4-1) degrees of freedom (-1 because of \(\sum p_i=1\))
For Mendels’ data we find
O <- c(315, 101, 108, 32)
E <- sum(O)*c(9/16, 3/16, 3/16, 1/16)
chi2 <- sum((O-E)^2/E)
round(c(chi2, 1-pchisq(chi2, 3)), 2)
## [1] 0.47 0.93
The chisquare statistic was already known in the mid 19th century but its distribution was derived by Karl Pearson in 1900. His argument was as follows:
O is the sum of indicator random variables (X_i is of type i or not), so O has a binomial distribution
if n is large enough \((O-E)/\sqrt E \approx N(0,1)\)
therefore \((O-E)^2/E \approx \chi^2(1)\)
finally \(\sum_1^n(O-E)^2/E \sim \chi^2(n-1)\) because there is one restriction, namely \(\sum O=n\).
We have also seen in the section on the large sample theory of LRT’s that the chisquare statistic is asymptotically equivalent to the likelihood ratio test statistic.
Often in Statistics we assume that the data was generated by a specific distribution, for example the normal. If we are not sure that such an assumption is justified we would like to test for this.
Say we have X1, .., Xn iid F, and we wish to test
\[H_0: F=N(0,1)\]
First notice that here the alternative hypothesis is
\[H_0: F \ne N(0,1)\]
or even simply left out. Either way it is a HUGE set, made up of all possible distributions other than N(0,1). This makes assessing the power of a test very difficult.
Another famous data set in statistics is the number of deaths from horsekicks in the Prussian army from 1875-1894:
kable.nice(head(horsekicks))
Year | Deaths | |
---|---|---|
1 | 1875 | 3 |
2 | 1876 | 5 |
3 | 1877 | 7 |
4 | 1878 | 9 |
5 | 1879 | 10 |
6 | 1880 | 18 |
It has been hypothesized that this data follows a Poisson distribution. Let’s carry out a hypothesis test for this.
First of a Poisson distribution has a parameter, \(\lambda\) . Clearly even if the assumption of a Poisson distribution is correct it will be correct only for some values of \(\lambda\) . We reject the null if the the chi-square statistic is small, so if we reject it for the value of \(\lambda\) that minimizes the chi-square statistic, we would also reject it for any other value of \(\lambda\) .
The chisquare goodness-of-fit test is a large-sample test, it has the assumption that none of the expected numbers be to small. We deal with this by combining some categories. We will consider the cases 0-6, 7-9, 10-12 and Over 12. Then
cells <- c(0:6, 7:9, 10:12, 13:100)
O <- c(6, 4, 5, 5)
chi2 <- function(l) {
y <- 0*l
p <- rep(0, 4)
for(i in seq_along(l)) {
p[1] <- sum(dpois(0:6, l[i]))
p[2] <- sum(dpois(7:9, l[i]))
p[3] <- sum(dpois(10:12, l[i]))
p[4] <- sum(dpois(13:100, l[i]))
E <- sum(O)*p
y[i] <- sum((O-E)^2/E)
}
y
}
l <- seq(9, 10, length=1000)
y <- chi2(l)
lhat <- l[which.min(y)]
ggplot(data.frame(l=l, y=y), aes(l, y)) +
geom_line(color="blue") +
geom_vline(xintercept = lhat)
lhat
## [1] 9.37037
round(c(chi2(lhat), 1-pchisq(chi2(lhat), 2)), 3)
## [1] 4.713 0.095
Under the null hypothesis the \(\chi^2\) statistic has a \(\chi^2\) distribution with m-k-1 degrees of freedom, where m is the number of classes and k is the number of parameters estimated from the data. So here we have m-k-1 = 4-1-1 = 2 d.f.
In the binning we have used, some E are a bit small. We could of course bin even further, but then we also loose even more information.
Notice that here we used an unusual estimation method, called minimum chi-square. Often in practice people use maximum likelihood, this however is wrong!
The adjustment of the degrees of freedom for the number of estimated parameters has an interesting history. It does not appear in Pearson’s original derivation. In fact, following Pearson’s logic there should be no need for this adjustment, because if the sample size is large enough any parameter should be estimated with sufficiently high precision. The need for the adjustment was recognized only 20 years after the original publication of Pearson by none other than Karl Fisher and is now sometimes called the Fisher-Pearson statistic.
Let’s study this question for a bit. Say we want to test
\[H_0: F=Bin(m,p)\]
The following graphs show the histograms of the p values of 10000 simulated experiments. Clearly the test without the adjustment is wrong, even for large sample size.
X2 <- function(B=1e4,n=100,m=5,p=0.4) {
A=rep(0,B)
for(i in 1:B) {
x=rbinom(n,m,p)
E=n*dbinom(0:m, m, mean(x)/m)
O=table(x)
if(length(O)<6) O=c(O,0)
A[i]=sum((O-E)^2/E)
}
A
}
A=X2()
df=data.frame(p1=1-pchisq(A, 6-1),
p2=1-pchisq(A, 6-1-1))
A[1:5]
## [1] 2.145990 5.577191 5.435651 2.255159 19.464419
bw <- 1/50
ggplot(df, aes(p1)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
labs(title = "No Adjustment")
bw <- 1/50
ggplot(df, aes(p2)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
labs(title = "With Adjustment")
Say we have \(X_1,..X_n\sim Pois(\lambda)\).
The following is the histogram of 5000 simulated data sets with n=10000, \(\lambda=1.0\), and bins x=0, .., x=5, x>5. In each run \(\lambda\) is estimated by minimum chi-square.
The red curve is dchisq(x,7-1) and the blue one is dchisq(7-1-1)
Say we have a data set and we want to test whether is comes from a normal distribution. In order to use the \(\chi^2\) test we first need to bin the data. There are two basic strategies:
use equal size bins (with the exception of the first and the last)
use adaptive bins chosen so that each bin has roughly the same number of observations.
Testing for normality is a very important problem, although because of simulation not quite as important today as it used to be. There are a number of test available for this problem, most of them much better (that is with higher power) than the chisquare test. Look for example for the Shapiro-Wilks test and the Anderson-Darling test.
A very good way to assess the distribution of a sample (such as normality) is to draw a graph specifically designed for this purpose, the probability plot. It plots the sample quantiles vs. the quantiles of the hypothesized distribution. If the data follows that distribution the resulting plot should be linear.
Say we have X1, .., Xn which are continuous and independent r.v. and we wish to test
\[H_0: X_i\sim F\]
for all i
Above we talked about graphs that give us some idea whether the data really comes from a certain distribution. Now let’s use the empirical distribution function. If the null hypothesis is true, than the empirical cdf should be close to the true one, that is the “distance” between the two curves should be small.
How can we define this “distance”? in mathematics there are a number of possible definitions:
\[\int_{-\infty}^\infty \vert F(x-\hat{F}(x)\vert dx\]
\[\int_{-\infty}^\infty \left( F(x-\hat{F}(x)\right)^2 dx\]
\[\max \left\{\vert F(x-\hat{F}(x)\vert : x\in R\right\}\]
We are going to consider the \(L^\infty\) norm here, so we have the test statistic
\[ D = \max \left\{|F(x)-\hat{F}(x)|:x \in R \right\} \]
This is called the Kolmogorov-Smirnov statistic.
At first glance it appears that computing D is hard: it requires finding a maximum of a function which is not differentiable. But inspection of the graphs (and a little calculation) shows that the maximum has to occur at one of the jump points, which in turn happen at the observations. So all we need to do is find F(Xi)-Fhat(Xi) for all i.
Next we need the null distribution, that is the distribution of D if the null hypothesis is true. The full derivation is rather lengthy and won’t be done here, but see for example J.D Gibbons, Nonparametric Statistical Inference. The main result is that if F is continuous and X ~ F, then F(X) ~ U[0,1], and therefore D does not depend on F, it is called a distribution-free statistic. It’s distribution can be found by simply assuming that F is U[0,1].
The method is implemented in R in the routine ks.test where x is the data set and y specifies the null hypothesis, For example y=“pnorm” tests for the normal distribution. Parameters can be given as well. For example ks.test(x,“pnorm”,5,2) tests whether X~N(5,2).
Note that this implementation does not allow us to estimate parameters from the data. Versions of this test which allow such estimation for some of the standard distributions are known, but not part of R. We can of course use simulation to implement such tests.
It is generally recognized that the Kolmogorov-Smirnov test is much better than the Chisquare test.
For our general discussion this test is interesting because it does not derive from any specific principle such as the likelihood principle. It is simply an idea (let’s compare the cdf under \(H_0\) with the empirical cdf) and a lot of heavy probability theory. Such methods are quite common in Statistics.