Conditioning and Importance Sampling

Conditioning

We have previously seen a famous formula for conditional expectations: \[ E \left\{ E[X|Y] \right\}=E[X] \]

Example

Say (X,Y) is a discrete rv with joint given by
0 1
0 0.1 0.0
1 0.1 0.2
2 0.0 0.6

So the marginal of X is

x P(X=x)
1 0 0.1
2 1 0.3
3 2 0.6

and so \[ E[X]=0\cdot 0.1+1 \cdot 0.3+2 \cdot 0.6=1.5 \] Also the marginal of Y is

y P(Y=y)
1 0 0.2
2 1 0.8

the conditional of \(X|Y=0\) is

x P(X=x|Y=0)
1 0 0.5
2 1 0.5
3 2 0.0

so \[ E[ X|Y=0 ] =0 \cdot 0.5+1 \cdot 0.5+2 \cdot 0=0.5 \]

and the conditional of \(X|Y=1\) is

x P(X=x|Y=1)
1 0 0.00
2 1 0.25
3 2 0.75

\[ E[X|Y=1]=0 \cdot 0+1 \cdot 0.25+2 \cdot 0.75=1.75. \]

Now let the rv \(Z=E[X|Y]\), then
z P(Z=z)
1 0.50 0.2
2 1.75 0.8

and finally \[ E\left\{ E[X|Y] \right\} = E[Z]=0.5 \cdot 0.2+1.75 \cdot 0.8=1.5=E[X] \]

There is also an equivalent formula for the conditional variance:

\[ Var[X]=E[Var(X|Y)]+Var[E(X|Y)] \] Let’s see: \[ Var[X]=E[X^2]-E[X]^2 = \\ 0^2 \cdot 0.1+1^2 \cdot 0.3+2^2 \cdot 0.6 - 1.5^2 = 0.45 \]

Now

\[ Var[E(X|Y)] = Var[Z] = \\ 0.5^2 \cdot 0.2+1.75^2 \cdot 0.8 - 1.5^2 = 2.5 - 2.25 = 0.25 \]

Also Var[X|Y] is a rv (just like E[X|Y]) with

\[ \begin{aligned} &Var[X|Y=0] = E[X^2|Y=0] -E[X|Y=0]^2\\ &Var[X|Y=1] = E[X^2|Y=1] -E[X|Y=1]^2 \end{aligned} \]

so if we set \(Z_1 = Var[X|Y]\) we have

z P(Z1=z)
1 0.250 0.2
2 1.875 0.8

and \[ E[Var(X|Y)] = E[Z_1] = 0.25 \cdot 0.2+0.1875 \cdot 0.8 = 0.2 \]

and so

\[ Var(E[X|Y])+E[Var(X|Y)] = 0.2+0.25 = 0.45 \]

So, how can we use this formula for reducing the variance of our simulation estimators? Because \(Var(X|Y)>0\) always we have \[ Var(X) \ge Var[E(X|Y)] \] for any rv Y. So say we run a simulation yielding a rv X with \(E[X]=\theta\) and the simulation yields a second rv Y, such that \(E[X|Y]\) is known. Since \(E \left\{ E[X|Y] \right\} =E[X]=\theta\) it follows that \(E[X|Y]\) is also an unbiased estimator of \(\theta\) and has a variance not larger than X itself.

Example

Say we would like to use simulation to estimate the value of \(\pi\) (=3.14…). A straight-forward simulation is as follows:

generate \(V_1, V_2\) iid \(U[-1,1]\). If \(V_1^2+V_2^2 \le 1\) set \(Z_i=1\), otherwise 0. Run the simulation n times, then \(4(\sum Z_i)/\)n is an estimator of \(\pi\)

B <- 1e5
u1 <- 2*runif(B)-1
u2 <- 2*runif(B)-1
z <- ifelse(u1^2 + u2^2 < 1, 1, 0)
plot(u1, u2, xlim = c(-1, 1), type = "n", 
     ylim = c(-1, 1), pch = ".")
points(u1[z==1], u2[z==1], col = "red", pch = ".")
points(u1[z==0], u2[z==0], col = "blue", pch = ".")

z <- 4*z
out1 <- round(c(mean(z), var(z)) ,4)
cat("Standard Simulation :", out1,  "\n")
## Standard Simulation : 3.139 2.703

Now let’s use the estimator \(E[Z|V_1]\) instead of \(Z\). Note

\[ \begin{aligned} &E[Z|V_1=v]=P(V_1^2+V_2^2 \le 1|V_1=v)=\\ &P(v^2+V_2^2 \le 1|V_1=v)=\\ &P(V_2^2 \le 1-v^2|V_1=v)=\\ &P(V_2^2 \le 1-v^2)=\\ &P(-\sqrt{1-v^2} \le V_2 \le \sqrt{1-v^2})=\\ &\int_{-\sqrt{1-v^2}}^{\sqrt{1-v^2}} \frac12 dx=\sqrt{1-v^2}\\ &\text{so}\\ &E[Z|V_1]=\sqrt{1-V_1^2} \end{aligned} \] and so \(\sqrt{1-V_1^2}\) is a better estimator than Z alone.

u <- 2 * runif(B) - 1
z <- 4 * sqrt(1 - u^2)
out2 <- round(c(mean(z), var(z)) ,4)
cat("Conditional Simulation :", out2, "\n")
## Conditional Simulation : 3.142 0.7974
cat("Variance reduction: ", out1[2]/out2[2])
## Variance reduction:  3.39

Note that this new estimator has another advantage: it needs only one \(U[0,1]\) per simulation run.

How much better is it? Let’s see: \[ \begin{aligned} &Z \sim \text{Ber}(\frac\pi4) \\ &Var(Z)=\frac\pi4(1-\frac\pi4)=0.1686\\ &Var(\sqrt{1-V_1^2})=E[(\sqrt{1-V_1^2})^2]-E[\sqrt{1-V_1^2}]^2=\\ &E[1-V_1^2]-(\frac\pi4)^2=\\ &1-E[V_1^2]-(\frac\pi4)^2=\\ &1-(\frac\pi4)^2 - \left( Var(V_1) + (E[V_1])^2 \right)=\\ &1-(\frac\pi4)^2 - \left( \frac{1^2-(-1)^2}{12} + (0)^2 \right)=0.0498 \end{aligned} \]

and so

\[ Var(Z) = Var( 4(\sqrt {1-V_1^2} )] = 16*0.0498 = 0.7968 \]

Example

Say \(X \sim \text{Exp}(1)\), \(Z \sim \text{Exp}(1/2)\), independent and we want to find \(p=P(X+Z\ge 4)\)

\[ \begin{aligned} &P(X+Z\ge 4)=E[I_{[4, \infty)}(X+Z)]=\\ &E \left\{ E[I_{[4, \infty)}(X+Z)]|Z] \right\}=\\ &E[I_{[4, \infty)}(X+Z)]|Z =z]=\\ &E[I_{[4, \infty)}(X+z)]|Z =z]=\\ &P(X>4-z)=1-P(X<4-z)=\\ &\exp(-(4-z))=\exp(z-4) \end{aligned} \] if \(z<4\) and 0 otherwise. So

B <- 1e5
x <- rexp(B, 1)
z <- rexp(B, 2)
v1 <- ifelse(x+z>4, 1, 0)
cat("Standard Simulation :", mean(v1), "  Variance :", var(v1), "\n")
## Standard Simulation : 0.03649   Variance : 0.03516
v2 <- ifelse(z<4, exp(z-4), 1)
cat("Conditioning Simulation :", mean(v2), "  Variance :", var(v2), "\n")
## Conditioning Simulation : 0.03645   Variance : 0.001809
cat("Variance reduction: ", var(v1)/var(v2))
## Variance reduction:  19.44

Example

say we want to find

\[ I=\int_0^\infty \int_0^1 \sqrt{x+y} \text{ }e^{-x}dx \]

Now \(I=E[ \sqrt{X+U} ]\) where \(X \sim \text{Exp}(1)\) and \(U \sim U[0,1]\).

Let \(V=E[ \sqrt{X+U}|X ]\), then

\[ \begin{aligned} &E[ \sqrt{X+U}|X=x ]=\\ &E[ \sqrt{x+U} ]=\int_0^1 \sqrt{x+u} du=\\ &\frac23\sqrt{(x+u)^3}|_0^1=\\ &\frac23\left( \sqrt{(x+1)^3} - \sqrt{x^3}\right) \end{aligned} \]

B <- 1e5
x <- rexp(B, 1)
u <- runif(B)
v1 <- sqrt(x+u)
cat("Standard Simulation :", mean(v1), "  Variance :", var(v1), "\n")
## Standard Simulation : 1.159   Variance : 0.1577
x <-  rexp(B, 1)
v2 <- 2/3*(sqrt((x+1)^3) - sqrt((x)^3))
cat("Conditioning Simulation :", mean(v2), "  Variance :", var(v2), "\n")
## Conditioning Simulation : 1.159   Variance : 0.1359
cat("Variance reduction: ", var(v1)/var(v2))
## Variance reduction:  1.161

Importance Sampling

Example

Say (X, Y, Z) has a multivariate normal distribution with mean vector (0,0,0) and variance-covariance matrix

\[ \begin{bmatrix} 1 & 0.8 &0.9\\ 0.8 & 1 & 0.5 \\ 0.9 & 0.5 & 1 \end{bmatrix} \]

we want to know \(P(\min\{X,Y,Z\}>5)\).

Note: \(P(\min\{X,Y,Z\}>5)= P(X>5, Y>5, Z>5)\)

Let’s see:

library(mvtnorm)
B <- 1e5
VC <- cbind(c(1, 0.8, 0.9), c(0.8, 1, 0.5), c(0.9, 0.5, 1))
x <- rmvnorm(B, c(0,0,0), VC)
z <- apply(x, 1, min)
sum(z>5)/B
## [1] 0

So this is a very small probability, and even \(10^5\) runs is not enough. What can we do?

Here is a very strange idea: instead of generating data from a distribution where \(P(T>5)\) is rare, let’s generate it from a distribution where it is not, and then adjust things somehow:

y <- matrix(rnorm(3*B, 6), ncol=3)
z <- apply(y, 1, min)
sum(z>5)/B
## [1] 0.5944

But how does this help us? Here is the main calculation. Say we have a random variable X with density f, a random variable Y with density g and some function h, then

\[ \begin{aligned} &E[h(X)] =\int h(x)f(x)dx \\ &\int h(x)\frac{g(x)}{g(x)}f(x)dx = \\ &\int \left[h(x)\frac{f(x)}{g(x)}\right]g(x)dx = \\ &E\left[h(Y)\frac{f(Y)}{g(Y)}\right] \end{aligned} \]

the term \(\frac{f(x)}{g(x)}\) is called the weights.

so in our case \(h(x)=I_{(5, \infty)}(x)\) and so

I <- c(1:B)[z>5]
w <- dmvnorm(y, c(0, 0, 0), VC)/apply(dnorm(y, 6), 1, prod)
sum(w[I])/B
## [1] 8.723e-10

Note how to choose Y? Obviously we need Y such that it can’t happen that \(P(Y=x)>0\) and \(P(X=x)=0\). In general we should choose a Y with the same support as X, that is \(P(X=x)>0\) iff \(P(Y=x)>0\).

It is good idea to choose Y such that the event of interest happens about 50% of the time.

Example

Let \(X\sim N(0,1)\). We want to find \(P(X>5)\).

Of course

1-pnorm(5)
## [1] 2.867e-07

so again standard simulation won’t work. Instead let’s use importance sampling with \(Y\sim N(5, 1)\), then

B <- 1e5
y <- rnorm(B, 5)
I <- c(1:B)[y > 5]
print(length(I)/B)
## [1] 0.4991
w <- dnorm(y[I])/dnorm(y[I], 5)
sum(w)/B
## [1] 2.859e-07

It is not necessary to have a Y that “looks like” X. In the example above we could also have used

y <- rgamma(B, 5, 1)
I <- c(1:B)[y > 5]
print(length(I)/B)
## [1] 0.4396
w <- dnorm(y[I])/dgamma(y[I], 5, 1)
sum(w)/B
## [1] 2.891e-07

Example

Let \(X\sim Geom(0.02)\), \(Y\sim Geom(0.01)\). Find \(P(Z>300)\).

B <- 1e5
x <- (rgeom(B, 0.02)+1)+(rgeom(B, 0.01)+1)
sum(x>1000)/B
## [1] 9e-05

Now we want to use importance sampling, however, we need the density of the sum of two geometric rv’s with different rates:

\[ \begin{aligned} &P(X+Y=k) =\sum_{i=1}^{k-1} P(X=k-i)P(Y=i) \\ &\sum_{i=1}^{k-1} 0.02\times 0.98^{k-i}0.01\times 0.99^{i} = \\ &0.0002\sum_{i=1}^{k-1} 0.98^{k-i} 0.99^{i} \end{aligned} \]

df <- function(x) {
  y <- rep(0, length(x))
  for(i in seq_along(x)) {
    k <- 1:x[i]
    y[i] <- sum(0.98^(x[i]-k)*0.99^k)
  }
  0.0002*y
}
B <- 1e3
y <- rgeom(B, 0.001)
I <- c(1:B)[y>1000]
length(I)/B
## [1] 0.369
w <- df(y)/dgeom(y-1, 0.001)
sum(w[I])/B
## [1] 8.942e-05

Example

say X, Y and Z have a standard normal distribution. Find \(P(|XYZ|>K)\), for example \(K=10\)

Now there is no way to do this analytically, and again the probability is very small. So we will use IS with X’, Y’ and Z’ generated from normal distributions with mean 0 and standard deviation s. For our case of \(K=10\) \(s=3.5\) works good. In general, for some K play around a bit to find a good s.

B <- 1e5
K <- 10
s <- 3.5
x <- rnorm(B, 0, s)
y <- rnorm(B, 0, s)
z <- rnorm(B, 0, s)
T <- abs(x*y*z)
I <- c(1:B)[T > K]
print(length(I)/B)
## [1] 0.4602
w <- dnorm(x[I])/dnorm(x[I], 0, s)*dnorm(y[I])/dnorm(y[I],  0, s)*dnorm(z[I])/dnorm(z[I], 0, s)
sum(w)/B
## [1] 0.000359

Example

say we have a rv X geometric with \(p=0.5\). We want to find \(P( \log (X!)>50)\).

Let’s try to solve this problem analytically. First, \(\log(x!)\) is an increasing function of x, so there exists \(x_{50}\) such that \(\log(x!)>50\) iff \(x>x_{50}\), so that \[ P(\log(X!)>50)=P(X\ge x_{50}) \]

Finding \(x_{50}\) analytically is hopeless, though. We can do it with R by trial and error: using *log(factorial(n))** for different values of n:

log(factorial(10))
## [1] 15.1
log(factorial(20))
## [1] 42.34
log(factorial(30))
## [1] 74.66
log(factorial(25))
## [1] 58
log(factorial(22.5))
## [1] 50.03

We find n=22.5, so

or about \(2.38 \times 10^{-7}\)

How about an R check? The problem with this is that the probability p we want to find is very small, so in a simple simulation as shown in we can expect the outcome of interest only about every 1 in 4.2million runs. In order to get some reasonably good estimate we probably need to run the simulation with \(n=10^9\).

Here is an idea: the problem is that our event of interest, \(\log(X!)>50\), is very rare, it almost never happens. Let’s instead sample from a distribution Y which has large values much more often, so that \(\log(Y!)>50\) happens more often. For example, let’s try Y geometric with \(p=0.05\):

B <- 1e5
y <- rgeom(B, 0.05)+1
logfac_y <- 0.918938533205 + (y+0.5)*log(y)-y
sum(logfac_y>50)/B
## [1] 0.3235

Note the calculation of \(\log(y!)\) this is based on Stirlings’ Formula:

\[n! \approx \sqrt{2\pi}n^{n+\frac12}e^{-n}\] so \[ \log(n! ) \approx \log(\sqrt{2\pi}) +(n+\frac12)\log(n)-n \]

this is to avoid problem with numbers that are bigger than R can handle!

So \(P(\log(Y!)>50)=0.35\).

y <- y[logfac_y >= 50]
w <- dgeom(y - 1, 0.5)/dgeom(y - 1, 0.05)
return(sum(w)/B)
## [1] 2.355e-07

Example (From a book by Robert and Casella)

let \(X\sim\) Cauchy and we want to use simulation to estimate \(\tau =P(X>2)\)

Method 1: (Direct Simulation) generate \(X_1, .., X_n\) iid Cauchy, estimate \(\tau =1/n \sum I_{[2, \infty )}(X_i)\)

x <- rcauchy(B)
z <- ifelse(x > 2, 1, 0)
c(mean(z), sd(z))
## [1] 0.1464 0.3535

Method 2: (Direct Simulation, using a special feature of the problem) Make use of the fact that a Cauchy is symmetric around 0, so

\[ P(X>2) = \frac12 P(|X|>2) \]

so generate \(X_1, .., X_n\) iid Cauchy, estimate

\[ \tau = \frac1{2n} \sum I_{[2, \infty)}(|X|_i) \]

z <- ifelse(abs(x) > 2, 1, 0)/2
c(mean(z), sd(z))
## [1] 0.1461 0.2274

Method 3: (Direct Simulation, using a special feature of the problem) Make use of the fact that

x <- runif(B, 0, 2)
z <- 1/2-2/pi/(1 + x^2)
c(mean(z), sd(z))
## [1] 0.1478 0.1686

Method 4: ( Direct Simulation, using a special feature of the problem)

x <- runif(B, 0, 0.5)
z <- 1/2/pi/(1 + x^2)
c(mean(z), sd(z))
## [1] 0.147589 0.009781

Method 5: (Importance sampling) Let’s use the rv Y with \(g(x)=2/x\), \(x>2\). Note

so this is actually the same as Method 4, with the same variance.

x <- runif(B)
z <- 2/(4 + x^2)/pi
c(mean(z), sd(z))
## [1] 0.147549 0.009787

Example

Say we have the following problem: we have \(X_1,.., X_n\) iid Pois(\(\lambda\)) and we want to test

\(H_0: \lambda = \lambda_0\) vs. \(H_a: \lambda \ne \lambda_0\)

we decide to use a Wald-type test, that is a test based on the CLT. Of course by the CLT

\[ T_n = \frac{ \sum X_i - n\lambda}{\sqrt{n\lambda}} \sim N(0,1) \]

and so we have a test of the form

reject \(H_0\) if \(|T_n|>z_{\alpha/2}\)

Now this is based on the CLT, and so we need to worry whether is works for our \(n\) and \(\lambda_0\), say \(n=100\) and \(\lambda_0=2.0\). Easy enough, we do a simulation:

  • generate rpois(100, 2.0)

  • calculate \(T_n\)

  • check whether \(|T_n|>z_{\alpha/2}\)

  • repeat B times

alpha <- 0.05
lambda <- 2
n <- 100
B <- 10^5
x <- rpois(B, n * lambda)
T_n = (x - n * lambda)/sqrt(n * lambda)
sum(abs(T_n) > qnorm(1 - alpha/2))/B
## [1] 0.05123

and so the test works as it should.

Note that we can use the fact that \(\sum X_i \sim \text{Pois}(n \lambda)\).

Now in most fields hypothesis tests are done with \(\alpha =0.05\) or so. In High Energy Physics, though, they use \(\alpha =2.87 \times 10^{-7}\)! (this strange number happens to be pnorm(-5), so they are looking for a “5-sigma effect”) . The reason for such a small \(\alpha\) is that in HEP we have a very serious simultaneous inference problem.

So now we should check whether this test still works if we use \(\alpha =2.87 \times 10^{-7}\). But even if it does \(|T_n|>5\) will only happen every 3.5 million runs or so (\(1/\alpha\)), so to get some reasonable estimate we would need \(B=10^9\) or so.

Let’s use IS instead. Again we need to generate data from an rv where \(|T_n|>5\) happens more often. Say we use \(Y \sim \text{Pois}(n \tau)\). Now

\[ \begin{aligned} &w(y) = \text{dpois} (y,n\lambda)/\text{dpois}(y,n\tau)\\ &T_n = \frac{y-n\lambda}{n\lambda}\\ &I_n(y)=1 if |T_n|>5 \text{, } 0 \text{ otherwise}\\ &P(|T_n|>5) = \text{Mean}(I_n w) \end{aligned} \]

For example, if \(n=100\) and \(\lambda=2.0\), use \(\tau=2.7\).

tau <- 2.7
alpha <- pnorm(-5)
y <- rpois(B, n * tau)
T_n <- ifelse(abs((y - n * lambda)/sqrt(n * lambda)) > qnorm(1 - alpha/2), 1, 0)
 w <- dpois(y, n * lambda)/dpois(y, n * tau)
alphahat = mean(w * T_n)
c(truealpha = alphahat, sigmas = qnorm(1 - 2 * alphahat), percentage = sum(T_n)/B)
##  truealpha     sigmas percentage 
##  5.693e-07  4.727e+00  4.369e-01

finds that the actual type I error probability is a about twice what it is supposed to be.