C++ with Rcpp

Basics

For this section you will need the Rcpp and the microbenchmark packages.

library(Rcpp)
library(microbenchmark)

You might also have to install Rtools from https://cran.r-project.org/bin/windows/Rtools/ for Windows.

C++ is probably the most widely used general programming language today. Actually, about half the code of base R is written in C++! (the rest is about half and half R and Fortran)

Sometimes when you have some code that takes a while to run it is worthwhile to spend some time speeding it up. One way to do this is to rewrite part of the code in C++.

Say we have the following problem: we have a data set with points (x, y) and for each point we want to find the Euclidean distance to the origin \(d=\sqrt{x^2+y^2}\). Here is a simple routine to do this:

dist1 <- function(x, y) {
  n <- length(x)
  d <- rep(0, n)
  for(i in 1:n) d[i] <- sqrt(x[i]^2+y[i]^2)
  d
}

Let’s see how long this takes:

x <- rnorm(1e6)
y <- rnorm(1e6)
microbenchmark(dist1(x, y))
## Unit: milliseconds
##         expr min    lq  mean median    uq   max neval
##  dist1(x, y) 218 221.9 229.4  225.9 234.4 341.4   100

Now of course we can immediately speed things up by vectorizing the routine:

dist2 <- function(x, y) {
  sqrt(x^2+y^2)
}
mb <- microbenchmark(dist1(x, y), dist2(x,y))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 232.38  38.37

Can we do even better?

# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector dist3(NumericVector x, NumericVector y) {
 int n=x.length();
 NumericVector dist(n);
 for (int i=0; i<n; ++i) dist[i]=sqrt(x[i]*x[i]+y[i]*y[i]);
 return dist;
}
mb <- microbenchmark(dist1(x, y), dist2(x,y), dist3(x,y))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 221.61  35.67  13.01

Note: in the R markdown document the above chunk starts like this:

```{r engine=‘Rcpp’}

this tells R markdown to treat this chunk as Rcpp code and compile it accordingly.

Actually, we can even do better than that:

# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector dist4(NumericVector x, NumericVector y) {
  NumericVector dist;
  dist=sqrt(x*x+y*y);
  return dist;
}
mb <- microbenchmark(dist3(x, y), dist4(x,y))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 12.51  9.78

Those of you who already know a bit of C++ are going to be quite amazed, because this is not even C++, it is sort of “vectorized” C++!


To start let’s discuss a few differences between R and C++ syntax:

  • in C++ every variable has to be explicitly defined.
    the most common data types are int, double, char and bool.

  • in C++ (almost) every line ends with a ;
  • vectors start with index 0, not 1
  • the for loop is for(int i=0; i<n; ++i)
  • the repeat loop is called a do loop and has the format

do {
# do stuff  
} while( condition );
  • whereas R does a lot of type conversion, C++ does none. This can lead to occasional strange behavior:
# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double mydiv(int n, int m) {
  return n/m;
}
mydiv(5, 2)
## [1] 2

so the result is 2, not 2.5. The reason is that when we divide two integers, in C++ the result is always an integer.

To get the correct result we need to do the type conversion ourselves:

# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double mydiv(int n, int m) {
  return n/double(m);
}
mydiv(5, 2)
## [1] 2.5

To be linked to R the C++ routine has to start with with these lines:

# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]

}

The easiest way to get started is to just use RStudio - File - New File - C++ File.

If the C++ routine takes as its argument a single value it can be defined in the usual C++ way, as in the mydiv routine. If we want to use a vector as an argument or the return object we need to use a special variable type called NumericVector. It is just what it says it is.

Debugging

Generally you should only turn fairly short code into C++, so debugging is not to big a problem. However, on occasion you might want to add a print statement to your code, so you can find out where it fails. here is how:

# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector dist5(NumericVector x, NumericVector y) {
 int n=x.length();
 NumericVector dist(n);
 for(int i=0; i<n; ++i) {
   dist[i]=sqrt(x[i]*x[i]+y[i]*y[i]);
   Rcout<<i<<" "<<dist(i)<<"\n";
 }
 return dist;
}
dist5(x[1:5], y[1:5])
## 0 1.2107
## 1 0.610054
## 2 2.07795
## 3 0.532629
## 4 2.13605
## [1] 1.2107 0.6101 2.0779 0.5326 2.1360

Sugar

Not only is Rcpp vectorized, many standard R functions have been ported to Rcpp.

Example

We want to write a routine that simulates Brownian motion in \(R^2\). That is, a stochastic process that moves as follows: if at time \(t\) it is at \((x_0, y_0)\), then at time \(t+\delta\) it is at \((x_0+\delta X, y_0+\delta Y)\) where \(X,Y\sim N(0, 1)\).

Note: generating stochastic process can be quite slow in R because they are difficult to vectorize, with one step of a loop depending on the previous one.

The output of our function is going to be a nx2 matrix, so we will use the data type NumericMatrix.

# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix bw(int n, double delta) {
 NumericMatrix xy(n, 2);
  xy(0, 0) = 0;
  xy(0, 1) = 0;
  for(int i=2; i<n; ++i) {
    xy(i, 0) = xy(i-1, 0) + delta*rnorm(1)[0];
    xy(i, 1) = xy(i-1, 1) + delta*rnorm(1)[0];
 }
 return xy;
}
plot(bw(10000, 1), 
     type = "l", 
     xlab = "x", 
     ylab = "y", 
     col = "blue")

Notice the term rnorm(1)[0], a little different from the standard R usage.

Anyone (like me!) who ever had to write a routine in C++ and needed a simple routine like rnorm which does not exist in C++ will find this very sweet! And that is why it is called sugar!

**Example (Fibonacci Numbers and the Golden Ratio)

let’s write a routine that calculates the golden ratio via the Fibonacci numbers. First, these are defined by

\[ \begin{aligned} &n_0 = 1\\ &n_1 = 1\\ &n_k = n_{k-1}+n_{k-2}\\ \end{aligned} \] and the golden ratio is the limit

\[ \lim_{k \rightarrow \infty} \frac{n_k}{n_{k-1}} \] because of its definition the Fibonacci numbers are most easily calculated using recursion:

fibR <- function(n) {
 if(n==0) return(0)
 if(n==1) return(1)
 return (fibR(n-1)+fibR(n-2))
}
fibR(10)
## [1] 55

Let’s write this with Rcpp:

# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int fib_cpp(const int n) {
 if(n==0) return(0);
 if(n==1) return(1);
 return fib_cpp(n-1)+fib_cpp(n-2);
}
fib_cpp(10)
## [1] 55

and now for the golden ratio:

golden_ratio <- function(n, fun) {
 fun(n)/fun(n-1)
}
mb <- microbenchmark(golden_ratio(10, fun=fibR), 
                     golden_ratio(10, fun=fib_cpp))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 163.9   6.5

not only is the cpp version much faster, in this example R is actually quite useless: while it does recursion, doing it to often quickly becomes a problem (because of memory issues).

Now

golden_ratio(30, fun=fib_cpp)
## [1] 1.618

The actual value of the golden ratio is of course \(\frac{1+\sqrt{5}}2 =1.618..\).

Using existing C++ routines.

So far we used C++ to speed up calculations, and we could use Sugar to call standard R functions in the C++ routine. There is another use, though. C++ has been the main computing language in many fields for a long time, and so there exist a large number of excellent routines already written. Say you found one of these and want to use it in your R program. Here is how:

What we need to do is to write a Rcpp wrapper routine, that eventually calls the C++ subroutine. Here is an example:

# include <Rcpp.h>
using namespace Rcpp;

//function declaration
double sum_of_squares(double x[], int n);

// [[Rcpp::export]]
double sub_routine(NumericVector x) {
  int n=x.length();
  double y[n];
  double ssq;
  for(int i=0;i<n;++i) y[i]=x[i];
  ssq=sum_of_squares(y, n);
  return ssq;
}

double sum_of_squares (double x[], int n)
{
  double r=0.0;
  for(int i=0;i<n;++i) r+=x[i]*x[i];
  return r;
}
sub_routine(1:10)
## [1] 385

The sum_of_squares routine is pure C++, like any you might find on the web!

Exercise

Say we have a vector of numbers \((x_1,.,,x_n)\) and we want to find the k-means, that is \(\frac1{k}\sum_{i=j}^{i=j+k-1} x_i\). Here is a routine in R:

kmeans <- function(x, k){
  n <- length(x)
  y <- rep(0, n-k+1)
  for(i in 1:(n-k+1))
    y[i] <- mean(x[i:(i+k-1)])
  y
}

Rewrite the routine as C++ and compare the speeds on

set.seed(111)
x <- rnorm(1e5)

and k=5

Let’s write this with Rcpp:

Let’s check first that the two routine do the same job:

rbind(kmeans(x[1:10], 5),
kmeans_cpp(x[1:10], 5))
##         [,1]    [,2]    [,3]    [,4]    [,5]   [,6]
## [1,] -0.5761 -0.5951 -0.8284 -0.9681 -0.6973 -0.762
## [2,] -0.5761 -0.5951 -0.8284 -0.9681 -0.6973 -0.762

and now

mb <- microbenchmark(kmeans(x, 5), kmeans_cpp(x, 5))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 794106   1202