Everything starts with generating X1, X2, .. iid U[0,1]. These are simply called random numbers. There are some ways to get these:
The R package random has the routine randomNumbers which gets random numbers from a web site which generates them based on (truly random) atmospheric phenomena.
require(random)
randomNumbers(20, 0, 100)
## V1 V2 V3 V4 V5
## [1,] 88 31 8 86 21
## [2,] 4 89 24 60 93
## [3,] 97 87 63 34 47
## [4,] 68 37 59 48 40
These are numbers that look random, smell random …
Of course a computer can not do anything truly random, so all we can do is generate X1, X2, .. that appear to be iid U[0,1], so-called pseudo-random numbers. In R we have the function runif:
runif(5)
## [1] 0.7252698 0.2748395 0.2826826 0.3520783 0.2993844
or
round(runif(5, min=0, max=100), 1)
## [1] 49.4 33.1 87.2 14.5 33.8
If we want to choose from a finite set we have
sample(letters, 5)
## [1] "l" "c" "x" "s" "h"
if no number is given it yields a random permutation:
sample(1:10)
## [1] 3 5 2 4 6 8 10 9 1 7
if we want to allow repetitions we can do this as well. Also, we can give the (relative) probabilities:
table(sample(1:5, size=1000,
replace=TRUE, prob=c(1, 2, 3, 2, 1)))
##
## 1 2 3 4 5
## 117 215 334 223 111
Notice that the probabilities need not be normalized (aka add up to 1)
Exercise
How can we randomly select 20 rows of the upr data set?
Exercise
A very useful technic in Statistics is called the Bootstrap. To use it one needs to find (many) Bootstrap samples. These are observations from the original data set, chosen at random and with repetition, as many as the original data set had. For example if the data is
## [1] 5 18 19 28 29 32 32 34 36 41
Bootstrap samples might be
## [1] 19 28 32 32 32 32 34 41 41 41
## [1] 5 18 18 28 32 32 36 36 41 41
## [1] 5 5 18 19 28 32 32 36 41 41
How can we find Bootstrap samples of the upr data set?
Not surprisingly many standard distributions are part of base R. For each the format is
Note we will use the term density for both discrete and continuous random variables.
Example Poisson distribution
We have \(X \sim \text{Pois}(\lambda)\) if
\[ P(X=x)=\frac{\lambda^x}{x!}e^{-\lambda}\text{; }x=0,1, ... \]
options(digits=4)
x <- c(0, 8, 12, 20)
# density
dpois(x, lambda=10)
## [1] 0.0000454 0.1125990 0.0947803 0.0018661
10^x/factorial(x)*exp(-10)
## [1] 0.0000454 0.1125990 0.0947803 0.0018661
# cumulative distribution function
ppois(x, 10)
## [1] 0.0000454 0.3328197 0.7915565 0.9984117
# random generation
rpois(5, 10)
## [1] 18 17 7 11 13
# quantiles
qpois(1:4/5, 10)
## [1] 7 9 11 13
Here is a list of the distributions included with base R:
beta distribution: dbeta.
binomial (including Bernoulli) distribution: dbinom.
Cauchy distribution: dcauchy.
chi-squared distribution: dchisq.
exponential distribution: dexp.
F distribution: df.
gamma distribution: dgamma.
geometric distribution: dgeom.
hypergeometric distribution: dhyper.
log-normal distribution: dlnorm.
multinomial distribution: dmultinom.
negative binomial distribution: dnbinom.
normal distribution: dnorm.
Poisson distribution: dpois.
Student’s t distribution: dt.
uniform distribution: dunif.
Weibull distribution: dweibull.
Exercise
Generate 10000 variates from a Binomial distribution with n=10, p=0.25 and compare the relative frequencies with the theoretical probabilities.
With some of these a bit of caution is needed. For example, the usual textbook definition of the geometric random variable is the number of tries in a sequence of independent Bernoulli trials until a success. This means that the density is defined as
\[ P(X=x)=p(1-p)^{x-1}\text{; }x=1,2,.. \] R however defines it as the number of failures until the first success, and so it uses
\[ P(X^*=x)=\text{dgeom}(x, p)=p(1-p)^x\text{; }x=0,1,2,.. \] Of course this is easy to fix. If you want to generate the “usual” geometric do
x <- rgeom(10, 0.4) + 1
x
## [1] 2 5 2 4 1 4 1 1 1 5
if you want to find the probabilities or cdf:
round(dgeom(x-1, 0.4), 4)
## [1] 0.2400 0.0518 0.2400 0.0864 0.4000 0.0864 0.4000 0.4000 0.4000 0.0518
round(0.4*(1-0.4)^(x-1), 4)
## [1] 0.2400 0.0518 0.2400 0.0864 0.4000 0.0864 0.4000 0.4000 0.4000 0.0518
Another example is the Gamma random variable. Here most textbooks use the definition
\[ f(x; \alpha, \beta)= \frac1{\Gamma{(\alpha)}\beta^\alpha}x^{\alpha-1}e^{-x/\beta}\text{; }x>0 \] but R uses
\[ f^*(x; \alpha, \beta)= \frac{\beta^\alpha}{\Gamma{(\alpha)}}x^{\alpha-1}e^{-\beta x}\text{; }x>0 \]
dgamma(1.2, 0.5, 2)
## [1] 0.06608
2^0.5/gamma(0.5)*1.2^(0.5-1)*exp(-2*1.2)
## [1] 0.06608
Again, it is easy to re-parametrize:
dgamma(1.2, 0.5, 1/(1/2))
## [1] 0.06608
Exercise
Consider a normal mixture model, say
\[f(x,\alpha)=\alpha \phi(x, 0, 1)+(1-\alpha)\phi(x, 4, 1)\] where \(\phi\) is the normal density
\[\phi(x,\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\{-\frac{(x-\mu)^2}{2\sigma^2}\} \]
How could we generate 1000 variates from f if (say) \(\alpha=0.3\)? This is what it should look like:
hist(x, breaks=100, main="")
if you need to generate random variates from a distribution that is not part of base R you should first try to find a package that includes it.
Example multivariate normal
there are actually several packages, the most commonly used one is mvtnorm
library(mvtnorm)
x <- rmvnorm(1000,
mean = c(0, 1),
sigma = matrix(c(1, 0.8, 0.8, 2), 2, 2))
plot(x,
pch=20,
xlab = expression(x[1]),
ylab = expression(x[2]))
sigma is the variance-covariance matrix, so in the above we have
\[ \begin{aligned} &\rho = \textit{Cor}(X, Y) =\\ &\frac{\textit{Cov}(X, Y)}{\sqrt{\textit{Var}(X)\textit{Var}(Y)}} = \\ &\frac{0.8}{\sqrt{1*2}} = 0.566\\ \end{aligned} \]
round(c(var(x[, 1]),
var(x[, 2]),
cor(x[, 1], x[, 2])), 3)
## [1] 0.985 2.095 0.580
If you can’t find a package you have to write your own! Here is a routine that will generate random variates from any function fun (given as a character vector) in one dimension on a finite interval [A, B]:
rpit <- function (n, fun, A, B)
{
f <- function(x) eval(parse(text=fun))
m <- min(2 * n, 1000)
x <- seq(A, B, length = m)
y <- f(x)
z <- (x[2] - x[1])/6 * cumsum((y[-1] + 4 * y[-2] + y[-3]))
z <- z/max(z)
y <- c(0, z)
xyTmp <- cbind(x, y)
approx(xyTmp[, 2], xyTmp[, 1], runif(n))$y
}
pit stands for probability integral transform, which is a theorem in probability theory that explains why this works.
Let’s try it out:
y <- rpit(1000, "x^2", 0, 1)
hist(y, 50, freq=FALSE, main="")
curve(3*x^2, 0, 1,
col = "blue",
lwd = 2,
add = TRUE)
notice that for rpit the function doesn’t even have to be normalized (aka integrate to 1).
or a bit more complicated:
y <- rpit(1000, "x*sin(2*pi*x)^2", 0, 1)
hist(y, 50, freq=FALSE, main="")
How about adding the density curve? For that we do need to normalize the function, that is we need to make sure that
\[ \int_0^1 x \sin(2 \pi x)^2 dx = 1 \]
but this is not a trivial integral, so we need to use a numerical method:
x <- seq(0, 1, length=250)
f <- x*sin(2*pi*x)^2
I <- sum( (f[-1]+f[-250])/2 *(x[-1]-x[-250]))
# Riemann sum
hist(y, 50, freq=FALSE, main="")
curve(x*sin(2*pi*x)^2/I, 0, 1,
col = "blue",
lwd = 2,
add = TRUE)
Exercise
generate 1000 variates from a Beta (1.5, 3) distribution, draw the histogram with 50 bins and add the density curve.
Hint: here you can use base R routines.
Want to learn how to generate data from any random vector? Come to my course ESMA5015 Simulation!