Simulation

Basic Idea

We already used simulation in a couple of situations. In this section we will take a closer look.

Example: Binomial proportion

Let’s start with a simple example: say we flip a coin 100 times and get 62 heads. Is this a fair coin?

Of course we discussed this problem before and there is a standard solution: we want to test

\[ H_0: \pi=0.5 \text{ vs. } H_a:\pi \ne 0.5 \] the test is done with

binom.test(62, 100)
## 
##  Exact binomial test
## 
## data:  62 and 100
## number of successes = 62, number of trials = 100, p-value =
## 0.02098
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.5174607 0.7152325
## sample estimates:
## probability of success 
##                   0.62

and with a p value of 0.021 we would reject the null hypothesis at the 5% level.

But let’s assume for the moment we do not know this solution. What could we do?

we can simulate flipping a fair coin and counting the number of heads with

smpl <- sample(c("Heads", "Tails"), size=100, replace=TRUE)
x <- sum(smpl=="Heads")
x
## [1] 45

Now we can do this many times:

B <- 1e4
x <- rep(0, B)
for(i in 1:B) {
  smpl <- sample(c("Heads", "Tails"), 
                 size=100, replace=TRUE)
  x[i] <- sum(smpl=="Heads")  
}
table(x)
## x
##  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50 
##   3   5  10  17  29  39  80 101 166 231 319 410 509 548 669 720 802 797 
##  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  68  69 
## 797 735 665 535 468 407 310 211 149  97  67  49  27  11   6   7   2   2

and we can find the p-value with

(sum(x<=38)+sum(x>=62))/B
## [1] 0.0207

So here we did the test via simulation, without any theory!

Simulation Error

Every time we run a simulation, the computer generates different data and so the answer comes out a bit different as well. But how different?

At least in one case there is a simple way to estimate this simulation error. In the example above, each simulation run is essentially a Bernoulli trial (\(x\le 38, x\ge 62\) or not). An approxiate \(95\%\) confidence interval for the true proportion is given by

\[ \hat{\pi} \pm 2\sqrt{\hat{\pi}(1-\hat{\pi})/n} \]

Now in a simulation we generaly have a large n, so this approximation should be quite good.

Example: Binomial proportion, continued

0.02 + c(-2, 2)*sqrt(0.02*0.98/1e4)
## [1] 0.0172 0.0228

so the true p value is between 0.017 and 0.023, in either case < 0.05 and we would reject the null.


Example Exponential rate

the sample below is assumed to come from an exponential distribution rate \(\lambda\). We wish to test

\[ H_0: \lambda=1 \text{ vs. } H_a:\lambda > 1 \]

kable.nice(matrix(exp.data, ncol=10, byrow=TRUE))
0.02 0.02 0.04 0.04 0.04 0.06 0.11 0.12 0.12 0.13
0.13 0.15 0.17 0.18 0.19 0.21 0.30 0.30 0.38 0.39
0.41 0.41 0.49 0.51 0.51 0.54 0.59 0.60 0.61 0.62
0.63 0.67 0.68 0.75 0.78 0.79 0.80 0.83 0.91 1.09
1.44 1.64 1.68 1.88 2.05 2.06 2.20 3.02 3.11 4.70

here is a solution via simulation. We know from theory that the mle of \(\lambda\) is \(1/\bar{X}\), so

B <- 1e4
sim.data <- matrix(rexp(length(exp.data)*B, 1), nrow=B)
sum(1/apply(sim.data, 1, mean)>1/mean(exp.data))/B
## [1] 0.0764

and so there is some weak evidence that the null is false.

Example Normal mean

below we have data from a normal distribution and we want to test

\[ H_0: \mu=10 \text{ vs. } H_a:\mu > 10 \]

kable.nice(matrix(norm.data , ncol=10, byrow=TRUE))
8.24 8.87 8.93 9.05 9.28 9.28 9.49 9.69 9.74 9.82
9.85 10.15 10.45 10.47 10.65 11.15 11.31 11.44 11.87 12.40

Again we want to use simulation and we can use the sample mean as our test statistic, but here we have an additional problem: we will of course generate data from a normal distribution with mean 10, but what should we use as the standard deviation? It is not defined by the null hypothesis.

There is an obvious answer: use the sample standard deviation. It is not clear however if that is indeed legitimate.

B <- 1e4
n <- length(norm.data)
sd.data <- sd(norm.data)
sim.data <- matrix(rnorm(n*B, 10, sd.data), nrow=B)
sum(apply(sim.data, 1, mean)>mean(norm.data))/B
## [1] 0.3293

how does this compare to the standard answer?

t.test(norm.data, mu=10, alternative = "greater")$p.value
## [1] 0.3348097

pretty close, certainly within simulation error.


sometimes one varies the standard deviation a bit in the simulation step. R does not have a method for finding confidence intervals for variances, but here is how to find them:

v <- var(norm.data)
lower <- v*19/qchisq(0.05/2, 19, 
                             lower.tail = FALSE)
upper <- v*19/qchisq(1-0.05/2, 19, 
                             lower.tail = FALSE)
sqrt(c(lower = lower, upper = upper))
##    lower    upper 
## 0.835776 1.605162

and so we can run the simulation also this way:

B <- 1e4
n <- length(norm.data)
sd.sim <- runif(1, 0.836, 1.605)
sim.data <- matrix(rnorm(n*B, 10, sd.sim), nrow=B)
sum(apply(sim.data, 1, mean)>mean(norm.data))/B
## [1] 0.3491

In essence this has a bit of a Bayesian flavor, we just introduced a prior for \(\sigma\)!

Permutation Tests

There is a class of methods essentially built on simulation. Here is an

Example: Equal means

Here are two data sets:

kable.nice(matrix(norm1.data , ncol=10, byrow=TRUE))
16.3 16.5 16.6 17.3 17.6 17.9 19.5 19.5 19.9 20.2
20.6 20.9 21.1 22.1 22.4 22.8 22.8 23.3 24.1 25.7
kable.nice(matrix(norm2.data , ncol=10, byrow=TRUE))
15.2 16.2 17.1 17.2 17.4 17.7 18.4 19.0 19.1 19.2
19.2 19.3 19.8 20.7 20.7 20.9 21.0 21.1 21.5 21.5
21.8 21.9 22.7 22.9 23.2 24.8 24.9 25.1 25.3 25.8
df <- data.frame(
  x=c(norm1.data, norm2.data),
  y=rep(c("1", "2"), c(20, 30)))
ggplot(df, aes(y, x)) +
  geom_boxplot()

Do they came from the same type of distribution but with different means?

So there is a distribution \(F\), and we can assume without loss of generality that \(E[X_F]=0\). There are \(\mu_1\) and \(\mu_2\) such that

\[ \begin{aligned} &X_1-\mu_1, .., X_n -\mu_1 \sim F \\ &Y_1-\mu_2, .., Y_m -\mu_2 \sim F \\ \end{aligned} \]

and we have the hypotheses

\[ H_0: \mu_1=\mu_2 \text{ vs. } H_a:\mu_1 \ne \mu_2 \]

a reasonable test statistics would be

\[ T=\frac{\bar{X}-\bar{Y}}{\sqrt{[(n-1)s_X^2+(m-1)s_Y^2]/(n+m-2)}} \]

because under the null \(E[\bar{X}-\bar{Y}]=0\) and the denominator is the usual estimator of the standard deviation (called the pooled standard deviation).

x <- norm1.data
y <- norm2.data
T0 <- (mean(x)-mean(y))/sqrt((19*var(x)+29*var(y))/49)
T0
## [1] -0.1198418

Now the idea is as follows: under the null hypothesis all the X’s and Y’s are an independent sample from the same distribution. In this case the order is of no consequence, any reordering should give an equally valid answer:

z <- sample(c(norm1.data, norm2.data)) #permutation
x <- z[1:20]
y <- z[21:50]
(mean(x)-mean(y))/sqrt((19*var(x)+29*var(y))/49)
## [1] -0.3628009

This is a perfectly legitimate value of T IF the null hypothesis is true.

Let’s repeat this many times. In fact let’s write a function that does it:

perm.test <- function(x, y,  B = 1e4, Show=FALSE) {
  n <- length(x)
  m <- length(y)
  T0 <- (mean(x) - mean(y))/
      sqrt(((n-1)*var(x)+(m-1)*var(y))/(n+m-2))  
  xy <- c(x, y)
  T <- rep(0, B)
  for(i in 1:B) {
    z <- sample(xy)
    x <- z[1:n]
    y <- z[(n+1):(n+m)]
    T[i] <- (mean(x) - mean(y))/
      sqrt(((n-1)*var(x)+(m-1)*var(y))/(n+m-2))  
  }
  if(Show) {
    bw <- diff(range(T))/50 
    print(ggplot(data.frame(x=T), aes(x)) +
          geom_histogram(aes(y = ..density..),
             color = "black", 
             fill = "white", 
             binwidth = bw) +
             geom_vline(xintercept=T0, size=2, col="blue"))  
  }
  sum(abs(T)>abs(T0))/B
}
perm.test(norm1.data, norm2.data, Show=TRUE)

## [1] 0.6822

and we see that the value of T for the real data is in no way unusual.

As a quick check that this actually works let’s do this again for some data where the means are known to be different:

perm.test(x=rnorm(20, 10, 5),
          y=rnorm(30, 15, 5),
          Show = TRUE)

## [1] 0.0023

In our case we also know that the F is a normal distribution. In this case there is of course a classic solution, the so-called two-sample-t test:

t.test(norm1.data, norm2.data)$p.value
## [1] 0.6807857

and notice that its p value is almost the same as the permutations tests!

How good a test is this? Let’s find out:

pwr.sim <- function(mu2, n=20, m=30, B = 2500) {
  pvals <- matrix(0, B, 2)
  colnames(pvals) <- c("Permutation", "t test")
  for(i in 1:B) {
    x <- rnorm(n)
    y <- rnorm(m, mu2)
    pvals[i, 1] <- perm.test(x, y, B=B)
    pvals[i, 2] <- t.test(x, y)$p.value
  }
  pvals
}

Let’s do a whole power curve! This takes a while to run, though, so the result is saved as pwr.tbl

df <- data.frame(x=rep(pwr.tbl[, 1], 2),
       y=c(pwr.tbl[, 2], pwr.tbl[, 3]),
       Method=rep(c("Permutation", "t test"), each=100))
ggplot(data=df, aes(x, y, color=Method)) +
  geom_line()

Can’t tell the difference? In fact the two methods have just about the same power, even so one depends strongly on the normal assumption, whereas the other one works without it.

This test was first discussed by Fisher in the 1930’s, but until fairly recently it was not doable. Nowadays it should be considered the go-to test for this kind of situation.


Let’s say we have \(\mu_1=0, \sigma_1=\sigma_2=1, n=m=50\). We will test at the \(5\%\) level and we want the test to have a power of \(95\%\). How large does \(\mu_2\) have to be?

In the case of the t test this can be done analytically. There is also an R routine:

power.t.test(n=50, power=0.95)
## 
##      Two-sample t test power calculation 
## 
##               n = 50
##           delta = 0.7281209
##              sd = 1
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

and we see that \(\mu_2=0.728\).

How about the permutation test? Here we need to use a numerical method, say the bisection algorithm. But there is an issue: for a fixed \(\mu_2\) we need to use simulation to find the corresponding power, and when we get close the correct value of \(\mu_2\) the simulation might yield a power less than 0.95 because of simulation error. Here is what to do: accrding to the discussion above the error is

\[ \begin{aligned} &2\sqrt{\hat{\pi}(1-\hat{\pi})/n} = \\ &2\sqrt{0.95(1-0.95)/B} = \\ & 0.435/\sqrt{B} \\ \end{aligned} \] say we use \(B=1000\). If we try a \(\mu_2\) and the simulation gives a power within \(0.95\pm 0.435/\sqrt{500}=(0.93, 0.97)\), we can stop:

low <- 0
high <- 1
repeat {
  mid <- (low+high)/2
  
  tmp <- apply(pwr.sim(mid, n=50, m=50, B=500), 2, 
        function(x) {sum(x<0.05)})/500
  print(round(c(mid, tmp), 3))
  if(tmp[1]>0.93 & tmp[1]<0.97) break
  if(tmp[1]<0.93) low <- mid
  else high <- mid
}
##             Permutation      t test 
##       0.500       0.656       0.660 
##             Permutation      t test 
##       0.750       0.976       0.972 
##             Permutation      t test 
##       0.625       0.880       0.880 
##             Permutation      t test 
##       0.688       0.930       0.926 
##             Permutation      t test 
##       0.656       0.902       0.906 
##             Permutation      t test 
##       0.672       0.924       0.936 
##             Permutation      t test 
##       0.680       0.928       0.922 
##             Permutation      t test 
##       0.684       0.946       0.942
mid
## [1] 0.6835938