stat2

Verifying the Simulation

Let’s say we want to do a simulation study. For this we need to generate \(X_1, .., X_n\) from some distribution \(F\). We have written a routine to do that, but how can we be sure that our routine is correct? In this section we will discuss a number methods for verifying that our simulation study is doing the right thing.

Graphical Checks

The first idea is to plot a number of graphs, comparing the desired distribution with the simulated data. The first of these is just the histogram:

Example

say we want to generate data from the distribution with
\[ f(x)=3.75(1-x)\sqrt x \text{, }0<x<1 \] (That is actually very easy once you realize that F = Beta(3/2,2))

x <- rbeta(1000, 3/2, 2)
hist(x, n = 50, freq = F)
f <- function(x) 3.75*(1-x)*sqrt(x)
curve(f, 0, 1, 
      lwd=2, col="blue", add = TRUE)

The histogram is a nice graph to check but it is not always clear how good a match we have, especially in the “tails” of the distribution.

Another useful graph is a plot of the empirical cdf vs. the true cdf. Of course for this we need to know the cdf:

sb4 <- function (which = 1, n = 1e4) {
    x <- rbeta(n, 3/2, 2)
    if (which==1) {
        hist(x, n = 100, freq = F)
        z <- seq(0, 1, 0.01)
        lines(z, 3.75*(1-z)*sqrt(z), lwd = 2)
    }
    if (which == 2) {
        x <- sort(x)
        t <- c(0:100)/100
        y <- 3.75 * (2/3 * t^(3/2) - 2/5 * t^(5/2))
        plot(t, y, ylim = c(0, 1), xlab = "x", 
             ylab = "", type = "l")
        segments(0, 0, x[1], 0)
        segments(x, c(0:(n - 1))/n, x, c(1:n)/n)
        segments(x[-n], c(1:(n - 1))/n, x[-1], 
                 c(1:(n - 1))/n)
        segments(x[n], 1, 1, 1)
    }
    if (which == 3) {
        print("Means")
        print(c(mean(x), 3/7), 3)
        print("Variance")
        print(c(var(x), 3 * 8/49/9), 3)
        print("P(X<0.25)")
        print(c(length(x[x < 0.25])/n, 
          3.75*(2/3*0.25^(3/2) - 2/5 * 0.25^(5/2))), 3)
    }
    
}

Now

sb4(2)

Numerical Checks

These are the most obvious thing to do for discrete data, just compare the required probabilities with the relative frequencies:

Example

Does the R command sample actually work? Let’s see:

p <- c(1, 2, 5, 1, 2)
x <- sample(1:5, size=1e4, replace=TRUE, prob = p)
print(rbind(table(x)/1e4, p/sum(p)), 3)

##           1     2     3      4     5
## [1,] 0.0898 0.181 0.458 0.0864 0.185
## [2,] 0.0909 0.182 0.455 0.0909 0.182

If the rv takes infinitely many values we might have to combine “low-probability” cases.

If the data is continuous we can still compute some useful numbers:

Example

let’s again use the example above:

sb4(3)

## [1] "Means"
## [1] 0.430 0.429
## [1] "Variance"
## [1] 0.0541 0.0544
## [1] "P(X<0.25)"
## [1] 0.261 0.266

Formal Tests

Finally we can do actual hypothesis tests, often called goodness-of-fit tests. In Statistics we assume that the data was generated by a specific distribution, for example the normal. If we are not sure that such an assumption is justified we would like to test for this.

Example

Say we have \(X_1, .., X_n\) iid \(F\), and we wish to test \(H_0: F=N(0,1)\)

Chisquare Goodness-of-fit Test

Consider the following result of a simulation study, based on 100 observations:

df <- data.frame(x=1:4, 
            True=c(0.31, 0.17, 0.05, 0.47),
            Simulation=c(0.3, 0.1, 0.12, 0.48))
kable.nice(df)

	x	True	Simulation
1	1	0.31	0.30
2	2	0.17	0.10
3	3	0.05	0.12
4	4	0.47	0.48

Now, is this a good fit? Good enough so we can say our simulation generates the correct data? Obviously if 0.31 is close to 0.3, and so on, we did ok. So what we need is a formula to combine all the info above into one number, a small one if the fit is good and a large one if it is not. There are many ways to do this, the most famous is this one: \[ X^2=\sum \frac{(O-E)^2}{E} \]

here “O” stands for observed and “E” for expected. The formula uses the actual data, not the frequencies, so O = 31, 17, 5 and 47. The expected are calculated under the null hypothesis, that is assuming the true probabilities hold, so E = np = 30, 10,12, 48. Therefore the test statistic is \[ X^2 = \\ (31-30)^2/30+(17-10)^2/10+(5-12)^2/12+(47-48)^2/48 =\\ 9.04 \] So is 9.04 “large” or “small”? Well, a famous theorem by Wilks says that under some conditions \(X^2\) has a chisquare distribution with k-1 degrees of freedom where k is the number of “categories”, here 4. So the p value of the test would be

round(1-pchisq(9.04,3), 4)

## [1] 0.0288

and if we use the usual 5% level we would reject the null hypothesis, these simulation does not generate the right data.

Example

A famous data set in statistics is the number of deaths from horsekicks in the Prussian army from 1875-1894:

kable.nice(horsekicks)

	Year	Deaths
1	1875	3
2	1876	5
3	1877	7
4	1878	9
5	1879	10
6	1880	18
7	1881	6
8	1882	14
9	1883	11
10	1884	9
11	1885	5
12	1886	11
13	1887	15
14	1888	6
15	1889	11
16	1890	17
17	1891	12
18	1892	15
19	1893	8
20	1894	4

It has been hypothesized that this data follows a Poisson distribution. Let’s carry out a hypothesis test for this.

First off a Poisson distribution has a parameter, \(\lambda\). Clearly even if the assumption of a Poisson distribution is correct it will be correct only for some values of \(\lambda\). What we need to do is to find the the value of \(\lambda\) that minimizes \(X^2\). If we reject the null for that \(\lambda\), we would also reject it for any other one.

This is called the method of minimum chisquare. Note that it depends on the binning used.

The chisquare goodness-of-fit test is a large-sample test , it requires a certain minimal sample size. The usual condition is \(E>5\), although it is known that this is very conservative.

Let’s see:

table(horsekicks$Deaths)

## 
##  3  4  5  6  7  8  9 10 11 12 14 15 17 18 
##  1  1  2  2  1  1  2  1  3  1  1  2  1  1

Now these are the observed counts, not the expected, but at least we can get started with this. We will combine cases as follows:

df <- data.frame(Bin=c("0-6", "7-9", "10-12", ">12"),
                 Counts=c(6, 4, 5, 5))
kable.nice(df)

	Bin	Counts
1	0-6	6
2	7-9	4
3	10-12	5
4	>12	5

Now \(X^2\) as a function of \(\lambda\) is given by

x2 <- function(l) {
  bin <- c(0, 6, 9, 12, 50)
  E <- 20*diff(ppois(bin, l))
  O <- c(6, 4, 5, 5)
  sum((O-E)^2/E)
}
l <- seq(8.5, 10, length=250)
y <- 0*l
for(i in seq_along(l)) 
  y[i] <- x2(l[i])
l0 <- l[which.min(y)]
plot(l, y, type="l")
abline(v=l0)

round(l0, 2)

## [1] 9.37

so 9.37 is the value we should use.

A generalization of the theorem above says that under the null hypothesis the \(X^2\) statistic has a \(\chi^2\) distribution with k-m-1 degrees of freedom, where k is the number of classes and m is the number of parameters estimated from the data. So here we have k-m-1 = 4-1-1 = 2 d.f, and we find a p-value

round(1-pchisq(x2(9.37), 2), 3)

## [1] 0.095

indicating that the data might well come from a Poisson distribution.

Does it matter what method of estimation is used for the parameter? The answer is yes, and it has to be minimum chisquare, that is minimizing the chi square statistic. Note that in general this is NOT the same as maximum likelihood.

In the binning we have used, some E are a bit small. We could of course bin even further, but then we also lose even more information. Instead we can use simulation to find the p value.

Example

Say we have a data set and we want to test whether is comes from a normal distribution. In order to use the \(\chi^2\) test we first need to bin the data. There are two basic strategies:

Use equal size bins (with the exception of the bins with small expected counts, often the first and the last)
Use adaptive bins chosen so that each bin has roughly the same number of observations.

There is also the question of of how many bins to use. It turns out that often a fairly small number (5 or 10) is best.

Testing for normality is a very important problem, although because of simulation not quite as important today as it used to be. There are a number of tests available for this problem, most of them much better (that is with higher power) than the chisquare test. Look for example for the Shapiro-Wilks test and the Anderson-Darling test.

A very good way to assess the distribution of a sample (such as normality) is to draw a graph specifically designed for this purpose, the probability plot. It plots the sample quantiles vs. the quantiles of the hypothesized distribution. If the data follows that distribution the resulting plot should be linear. In R we have the routine qqplot and for the normal distribution especially we have qqnorm. qqline adds a line that passes the first and third quartiles to help with reading the graph.

x <- rnorm(50)
qqnorm(x)
qqline(x)

Or using ggplot2:

df <- data.frame(x=x)
ggplot(data=df, aes(sample=x)) +
           geom_qq() + geom_qq_line()

Kolmogorov-Smirnov Goodness-of-Fit Test

Say we have \(X_1, .., X_n\) which are continuous and independent r.v. and we wish to test \(H_0: X_i \sim F\) for all i. To check this we draw the graph of the empirical vs. the hypothesized cdf. But how close do these have to match each other to decide we have a good fit?

One idea is to look for the largest difference between the two curves. This is exactly what the next, the Kolmogorov-Smirnov test, does. It uses the test statistic \[ D = \max \left\{ | F_n (x)-F(x) | \right\} \] At first glance it appears that computing D is hard: it requires finding a maximum of a function which is not differentiable. But inspection of the graphs (and a little calculation) shows that the maximum has to occur at one of the jump points, which in turn happen at the observations. So all we need to do is find \(F_n(X_i)-F(X_i)\) for all i.

The method is implemented in R in the routine ks.test where x is the data set and y specifies the null hypothesis, For example y=“pnorm” tests for the normal distribution. Parameters can be given as well. For example ks.test(x,“pnorm”,5,2) tests whether X~N(5,2).

Note that this implementation does not allow us to estimate parameters from the data. Versions of this test which allow such estimation are known for some of the standard distributions, but are not part of basic R. We can of course use simulation to implement such tests.

It is generally recognized that the Kolmogorov-Smirnov test is much better than the Chisquare test.

Human Perception of Randomness

Example

Consider the following two scatterplots. One of them was done by randomly picking points, the other is not quite so random. Which is which?

Example

A famous statistician used to do the following exercise with his classes: he had a bag with slips of paper. On each paper was either the word “Real” of “Fake”. He let each student pick a paper without him seeing it. Then he told them that tonight at home they should read it. If their paper said “Real” they should get a coin, flip it 100 times and write down the sequence of heads and tails. If it said “Fake” they should just make up a sequence but try to make it as real as they could. The next day in class he went around, looked at each students sequence and guessed whether it was “Real” or “Fake”. He got it right most of the time. How did he do that?

He looked at the longest “run” that is the longest sequence of either Heads or Tails. There should be at least one run of length 6 or longer, with probability about 80%. Most students who fake it don’t put runs anywhere near that long, feeling that this does not look random.

Example

Say we randomly select two points in the unit square. Is their distance also uniformly distributed?

The answer is no:

n <- 10000
x1 <- runif(n)
y1 <- runif(n)
x2 <- runif(n)
y2 <- runif(n)
z <- sqrt((x1 - x2)^2 + (y1 - y2)^2)
hist(z, breaks = 100)

Higher Dimensional Distributions

As we said earlier, everything gets much more difficult in higher dimensional space. The main problem is that the marginal distributions do not determine the joint distribution, except in the case of independence. As a first test, we can apply the ideas discussed above to the marginals, but if these pass the test we still can’t be sure that our simulation works.

As for formal tests, most of them do not generalize to higher dimensions, and those that seem to do run into the curse of dimensionality.

First, recall the following:

points in n-dimensional Euclidean space can be described by their coordinates written as n-tuples \((x_1,.., x_n)\).

A point is in the n-dimensional hypercube of side length s if \(|x_i|<s/2\) for all i. If s=1 it is called the unit hypercube.

A point is in the n-dimensional hypersphere of radius r if \[ \sum x_i^2<r^2 \] If r=1 it is called the unit hypersphere.

Example

What is the volume of the the n-d hypercube? Clearly the answer is \(s^d\), but consider what that means: if \(s<1\) \(s^d \rightarrow 0\) and if \(s >1\) \(s^d \rightarrow \infty\) as \(d \rightarrow \infty\). But the side length is fixed, it is the dimension that goes up!

Example

Let’s consider the following question: What is the probability that a point generated randomly in the unit square is actually in the unit circle?

In two dimensions we can do this analytically:

but in higher dimensions we need simulation:

n <- 10000
Probability <- rep(0, 10)
names(Probability) <- 1:10
for (k in 1:10) {
  x <- matrix(runif(n * k, -1, 1), n, k)
  d <- apply(x^2, 1, sum)
  Probability[k] <- round(length(d[d < 1])/n, 3)
}
Probability

##     1     2     3     4     5     6     7     8     9    10 
## 1.000 0.789 0.532 0.301 0.161 0.079 0.039 0.014 0.007 0.002

What we find is very strange: if d=10 p=0.003, which seems to say that in 10-dimensional space all randomly chosen points are in the edges!

We can also calculate the ratio of the volumes:

so the the ratio goes to 0 although the volume of the hypercube goes to infinity and the hypersphere touches the hypercube in 2d places!

Example

Multivariate normal distribution.

Let’s investigate the probability that a randomly chosen observation from a multivariate normal distribution is “out in the tails”, that is some given distance from the mean. Let \(X^d \sim N(0, I)\) be the “standard” normal in \(R^d\), then the of \(X^d\) is given by \[ f(x) = (2\pi)^{-d/2} exp(-x'x/2) \] Let \(S_{0.01}(x)\) be the set of all points where the value of the is 1% of it’s highest value, which is of course obtained at the mean. In one dimension we find:

x <- seq(0, 5, length=1000)
y <- dnorm(x)
z <- x[which.min(abs(0.01*y[1]-y))]
curve(dnorm(x),-4,4)
abline(h=dnorm(c(0, 3.033)))
abline(v=c(-3.033, 3.033))

and so our intuition suggests that very few points are far from the mean. However in d dimensions we find

and we find

d <- c(1, 5*1:5)
p <- round(1 - pchisq(2*log(100), d), 4)
names(p) <- d
p

##      1      5     10     15     20     25 
## 0.0024 0.1010 0.5123 0.8663 0.9803 0.9983

so in higher dimensions almost all the observations are out in the tails, almost none are close to the mean!

One consequence of the curse of dimensionality is that the chisquare goodness-of-fit test does not extend past 2 or 3 dimensions:

Example

say we have observations \(X_1~, .., X_n\) in \(R^d\) and we want to test whether they come from a uniform distribution, that is \[ f(x)=1 \text{ if } 0<x_i<1 \] for i = 1,..,d. First we need to bin the data. Let’s say we use the following binning: in each dimension we divide the interval into 10 equal sized bins, 0-1/10, 1/10-2/10,..,9/10-1. What is the probability that a randomly chosen observation from a uniform falls into one of these bins? Because of the uniform the probability is the same for all bins, and we have \[ P(0<X_i<1/10, i=1,..,d) = \\ \prod P(0<X_i<1/10) = 1/10^d \] Recall the requirement for the chisquare is that \(E>5\), so we need at least \(n=5*10^d\) observations. For d=3 that means n=5000, for d=5 it means n=500,000!

The required sample size grows exponentially with the dimension.