random.numbers

Generating Random Variates

Random Numbers

Everything starts with generating X₁, X₂, .. iid U[0,1]. These are simply called random numbers. There are some ways to get these:

random number tables
numbers taken from things like the exact (computer) time
quantum random number generators
…

The R package random has the routine randomNumbers which gets random numbers from a web site which generates them based on (truly random) atmospheric phenomena.

require(random)
randomNumbers(20, 0, 100)

##      V1 V2 V3 V4 V5
## [1,] 88 31  8 86 21
## [2,]  4 89 24 60 93
## [3,] 97 87 63 34 47
## [4,] 68 37 59 48 40

Pseudo-Random Numbers

These are numbers that look random, smell random …

Of course a computer can not do anything truly random, so all we can do is generate X₁, X₂, .. that appear to be iid U[0,1], so-called pseudo-random numbers. In R we have the function runif:

runif(5)

## [1] 0.7252698 0.2748395 0.2826826 0.3520783 0.2993844

round(runif(5, min=0, max=100), 1)

## [1] 49.4 33.1 87.2 14.5 33.8

If we want to choose from a finite set we have

sample(letters, 5)

## [1] "l" "c" "x" "s" "h"

if no number is given it yields a random permutation:

sample(1:10)

##  [1]  3  5  2  4  6  8 10  9  1  7

if we want to allow repetitions we can do this as well. Also, we can give the (relative) probabilities:

table(sample(1:5, size=1000, 
      replace=TRUE, prob=c(1, 2, 3, 2, 1)))

## 
##   1   2   3   4   5 
## 117 215 334 223 111

Notice that the probabilities need not be normalized (aka add up to 1)

Exercise

How can we randomly select 20 rows of the upr data set?

Exercise

A very useful technic in Statistics is called the Bootstrap. To use it one needs to find (many) Bootstrap samples. These are observations from the original data set, chosen at random and with repetition, as many as the original data set had. For example if the data is

##  [1]  5 18 19 28 29 32 32 34 36 41

Bootstrap samples might be

##  [1] 19 28 32 32 32 32 34 41 41 41

##  [1]  5 18 18 28 32 32 36 36 41 41

##  [1]  5  5 18 19 28 32 32 36 41 41

How can we find Bootstrap samples of the upr data set?

Standard Probability Distributions

Not surprisingly many standard distributions are part of base R. For each the format is

dname = density
pname = cumulative distribution function
rname = random generation
qname = quantile function

Note we will use the term density for both discrete and continuous random variables.

Example Poisson distribution

We have \(X \sim \text{Pois}(\lambda)\) if

\[ P(X=x)=\frac{\lambda^x}{x!}e^{-\lambda}\text{; }x=0,1, ... \]

options(digits=4)
x <- c(0, 8, 12, 20)
# density
dpois(x, lambda=10)

## [1] 0.0000454 0.1125990 0.0947803 0.0018661

10^x/factorial(x)*exp(-10)

## [1] 0.0000454 0.1125990 0.0947803 0.0018661

# cumulative distribution function
ppois(x, 10)

## [1] 0.0000454 0.3328197 0.7915565 0.9984117

# random generation
rpois(5, 10)

## [1] 18 17  7 11 13

# quantiles
qpois(1:4/5, 10)

## [1]  7  9 11 13

Here is a list of the distributions included with base R:

beta distribution: dbeta.
binomial (including Bernoulli) distribution: dbinom.
Cauchy distribution: dcauchy.
chi-squared distribution: dchisq.
exponential distribution: dexp.
F distribution: df.
gamma distribution: dgamma.
geometric distribution: dgeom.
hypergeometric distribution: dhyper.
log-normal distribution: dlnorm.
multinomial distribution: dmultinom.
negative binomial distribution: dnbinom.
normal distribution: dnorm.
Poisson distribution: dpois.
Student’s t distribution: dt.
uniform distribution: dunif.
Weibull distribution: dweibull.

Exercise

Generate 10000 variates from a Binomial distribution with n=10, p=0.25 and compare the relative frequencies with the theoretical probabilities.

With some of these a bit of caution is needed. For example, the usual textbook definition of the geometric random variable is the number of tries in a sequence of independent Bernoulli trials until a success. This means that the density is defined as

\[ P(X=x)=p(1-p)^{x-1}\text{; }x=1,2,.. \] R however defines it as the number of failures until the first success, and so it uses

\[ P(X^*=x)=\text{dgeom}(x, p)=p(1-p)^x\text{; }x=0,1,2,.. \] Of course this is easy to fix. If you want to generate the “usual” geometric do

x <- rgeom(10, 0.4) + 1
x

##  [1] 2 5 2 4 1 4 1 1 1 5

if you want to find the probabilities or cdf:

round(dgeom(x-1, 0.4), 4)

##  [1] 0.2400 0.0518 0.2400 0.0864 0.4000 0.0864 0.4000 0.4000 0.4000 0.0518

round(0.4*(1-0.4)^(x-1), 4)

##  [1] 0.2400 0.0518 0.2400 0.0864 0.4000 0.0864 0.4000 0.4000 0.4000 0.0518

Another example is the Gamma random variable. Here most textbooks use the definition

\[ f(x; \alpha, \beta)= \frac1{\Gamma{(\alpha)}\beta^\alpha}x^{\alpha-1}e^{-x/\beta}\text{; }x>0 \] but R uses

\[ f^*(x; \alpha, \beta)= \frac{\beta^\alpha}{\Gamma{(\alpha)}}x^{\alpha-1}e^{-\beta x}\text{; }x>0 \]

dgamma(1.2, 0.5, 2)

## [1] 0.06608

2^0.5/gamma(0.5)*1.2^(0.5-1)*exp(-2*1.2)

## [1] 0.06608

Again, it is easy to re-parametrize:

dgamma(1.2, 0.5, 1/(1/2))

## [1] 0.06608

Exercise

Consider a normal mixture model, say

\[f(x,\alpha)=\alpha \phi(x, 0, 1)+(1-\alpha)\phi(x, 4, 1)\] where \(\phi\) is the normal density

\[\phi(x,\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\{-\frac{(x-\mu)^2}{2\sigma^2}\} \]

How could we generate 1000 variates from f if (say) \(\alpha=0.3\)? This is what it should look like:

hist(x, breaks=100, main="")

Other Variates

if you need to generate random variates from a distribution that is not part of base R you should first try to find a package that includes it.

Example multivariate normal

there are actually several packages, the most commonly used one is mvtnorm

library(mvtnorm)
x <- rmvnorm(1000, 
             mean = c(0, 1), 
             sigma = matrix(c(1, 0.8, 0.8, 2), 2, 2))
plot(x, 
     pch=20,
     xlab = expression(x[1]),
     ylab = expression(x[2]))

sigma is the variance-covariance matrix, so in the above we have

\[ \begin{aligned} &\rho = \textit{Cor}(X, Y) =\\ &\frac{\textit{Cov}(X, Y)}{\sqrt{\textit{Var}(X)\textit{Var}(Y)}} = \\ &\frac{0.8}{\sqrt{1*2}} = 0.566\\ \end{aligned} \]

round(c(var(x[, 1]),
        var(x[, 2]),
        cor(x[, 1], x[, 2])), 3)

## [1] 0.985 2.095 0.580

If you can’t find a package you have to write your own! Here is a routine that will generate random variates from any function fun (given as a character vector) in one dimension on a finite interval [A, B]:

rpit <- function (n, fun, A, B) 
{
  f <- function(x) eval(parse(text=fun))
  m <- min(2 * n, 1000)
  x <- seq(A, B, length = m)
  y <- f(x)
  z <- (x[2] - x[1])/6 * cumsum((y[-1] + 4 * y[-2] + y[-3]))
  z <- z/max(z)
  y <- c(0, z)
  xyTmp <- cbind(x, y)
  approx(xyTmp[, 2], xyTmp[, 1], runif(n))$y
}

pit stands for probability integral transform, which is a theorem in probability theory that explains why this works.

Let’s try it out:

y <- rpit(1000, "x^2", 0, 1)
hist(y, 50, freq=FALSE, main="")
curve(3*x^2, 0, 1, 
      col = "blue",
      lwd = 2,
      add = TRUE)

notice that for rpit the function doesn’t even have to be normalized (aka integrate to 1).

or a bit more complicated:

y <- rpit(1000, "x*sin(2*pi*x)^2", 0, 1)
hist(y, 50, freq=FALSE, main="")

How about adding the density curve? For that we do need to normalize the function, that is we need to make sure that

\[ \int_0^1 x \sin(2 \pi x)^2 dx = 1 \]

but this is not a trivial integral, so we need to use a numerical method:

x <- seq(0, 1, length=250)
f <- x*sin(2*pi*x)^2
I <- sum( (f[-1]+f[-250])/2 *(x[-1]-x[-250])) 
# Riemann sum
hist(y, 50, freq=FALSE, main="")
curve(x*sin(2*pi*x)^2/I, 0, 1,
      col = "blue",
      lwd = 2,
      add = TRUE)

Exercise

generate 1000 variates from a Beta (1.5, 3) distribution, draw the histogram with 50 bins and add the density curve.

Hint: here you can use base R routines.

Want to learn how to generate data from any random vector? Come to my course ESMA5015 Simulation!