Extending R

Packages

R is the standard analysis tool in Statistics today. Pretty much any tool developed by anyone is availabel in R, but not necessarily as part of the basic distribution. Many "add-on's" are available as libraries, which are distributed as packages. The easiest way to get them is to use the install.packages command. Say you want the library mvtnorm, which has many methods useful for the multivariate normal distribution. At the R prompt type

install.packages("mvtnorm",lib="c"\\R\\library")

(assuming R is installed in c:\R)

it will ask you for a download site, use an US site on the bottom of the list.

You only need to do this once, from now on just type

library(mvtnorm)

or even better put this line into .First

To check that the library is installed type

search()

to see what is inside type

ls(2)

if the library is at position 2

If you are on one of the department computer everything is already installed.

C

R is a very high-level language. Because of the vectorization and the very large number of built-in routines it is a very good language for developing new analysis routines. This same high-level, though, comes at a price of a lot of overhead, and so for actual calculations R can be very slow. One way around this is to rewrite some parts in C (or C++). This used to be very complicated and difficult, but in recent years the add-on package Rcpp has been developed that makes it very easy.

First you need to install Rcpp. On Linux machines that is very simple but on Windows machines a few step need to be taken. Follow the instructions here

Here is an example for the use of Rcpp:

Let's say we want to write a simulation to answer the following question. Let Z₁,..,Z_n be iid N(0,σ) and let

X_n,σ = max{Z₁,..,Z_n}

What is the mean of X_n,σ?

We can find out using a simulation: generate n N(0,σ) find the maximum. Repeat this many times (say 100 times) and find the mean of the maximums. Here is an R routine that does this for 100 sigma's equally spaced from 0.1 to 1, and the plots sigma vs max:

maxsim <-
function (which=1,n,M=1000)
{
sigma=seq(0.1,1,length=100)
if(which==1) {
musigma=rep(0,100)
zmax=rep(0,M)
for(j in 1:100) {
for(i in 1:M) {
z=rnorm(n,0,sigma[j])
zmax[i]=max(z)
}
musigma[j]=mean(zmax)
}
}
if(which==2) musigma=maxsimC(n,sigma)
plot(sigma,musigma,type="l")
}

Now this is not very fast because of the double loop, so let's replace this part with a C routine:

_________________________________________________

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
Rcpp::NumericVector maxsimC(int n, Rcpp::NumericVector sigma) {

int Nsig=sigma.size();
int M=1000;

Rcpp::NumericVector zmax(M);
Rcpp::NumericVector musigma(Nsig);
Rcpp::NumericVector z(n);

for(int j=0;j<Nsig;++j) {
for(int i=0;i<M;++i) {
z=rnorm(n,0.0,sigma[j]);
zmax[i]=max(z);
}
musigma[j]=mean(zmax);
}
return musigma;
}

________________________________________________________

Obviously there are some differences between the R and the C routine, but otice first the similarities: the "main" part

of the two routines is essentially the same:

for(j in 1:100) {
   for(i in 1:M) {
         z=rnorm(n,0,sigma[j])
         zmax[i]=max(z)
   }
musigma[j]=mean(zmax)
}

_________

for(int j=0;j<Nsig;++j) {
   for(int i=0;i<M;++i) {
         z=rnorm(n,0.0,sigma[j]);
         zmax[i]=max(z);
   }
musigma[j]=mean(zmax);
}

______________________________

In fact one can often get started by opening a "template" such as

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
funname () {

return ;
}

and then copy-paste from the r routine.

here are some differences between R and C:

• in C every line ends with a ;

• vectors start at 0, not 1

There is another remarkable feature of this C routine. If we wanted to write this as a pure stand-alone C routine it would be much longer. For example, rnorm is not a X command. In fact basic C does not even have a random number generator included, let alone a routine to generate normals. So we would need to first implement (of find on the internet) a routine that does this. Similarly max and mean are not C, but R. Many of the basic R commands have been ported to Rcpp and are now available in exactly the same way as in R. This includes the vectorization!

Finally we want to "link" the C routine to R. this is as done as follows. Say we save the C routine in a file called maxsimC.cpp (and in the folder from which we started R). Now go to R and type

sourceCpp("maxsimC.cpp")

If ther are syntax errors you will be told, otherwise you are good. Now maxsim with which=2 runs it.