For this section you will need the Rcpp and the microbenchmark packages.
library(Rcpp)
library(microbenchmark)
You might also have to install Rtools from https://cran.r-project.org/bin/windows/Rtools/ for Windows.
C++ is probably the most widely used general programming language today. Actually, about half the code of base R is written in C++! (the rest is about half and half R and Fortran)
Sometimes when you have some code that takes a while to run it is worthwhile to spend some time speeding it up. One way to do this is to rewrite part of the code in C++.
Say we have the following problem: we have a data set with points (x, y) and for each point we want to find the Euclidean distance to the origin \(d=\sqrt{x^2+y^2}\). Here is a simple routine to do this:
dist1 <- function(x, y) {
n <- length(x)
d <- rep(0, n)
for(i in 1:n) d[i] <- sqrt(x[i]^2+y[i]^2)
d
}
Let’s see how long this takes:
x <- rnorm(1e6)
y <- rnorm(1e6)
microbenchmark(dist1(x, y))
## Unit: milliseconds
## expr min lq mean median uq max neval
## dist1(x, y) 218 221.9 229.4 225.9 234.4 341.4 100
Now of course we can immediately speed things up by vectorizing the routine:
dist2 <- function(x, y) {
sqrt(x^2+y^2)
}
mb <- microbenchmark(dist1(x, y), dist2(x,y))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 232.38 38.37
Can we do even better?
# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector dist3(NumericVector x, NumericVector y) {
int n=x.length();
NumericVector dist(n);
for (int i=0; i<n; ++i) dist[i]=sqrt(x[i]*x[i]+y[i]*y[i]);
return dist;
}
mb <- microbenchmark(dist1(x, y), dist2(x,y), dist3(x,y))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 221.61 35.67 13.01
Note: in the R markdown document the above chunk starts like this:
```{r engine=‘Rcpp’}
this tells R markdown to treat this chunk as Rcpp code and compile it accordingly.
Actually, we can even do better than that:
# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector dist4(NumericVector x, NumericVector y) {
NumericVector dist;
dist=sqrt(x*x+y*y);
return dist;
}
mb <- microbenchmark(dist3(x, y), dist4(x,y))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 12.51 9.78
Those of you who already know a bit of C++ are going to be quite amazed, because this is not even C++, it is sort of “vectorized” C++!
To start let’s discuss a few differences between R and C++ syntax:
in C++ every variable has to be explicitly defined.
the most common data types are int, double, char and bool.
the repeat loop is called a do loop and has the format
do {
# do stuff
} while( condition );
# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double mydiv(int n, int m) {
return n/m;
}
mydiv(5, 2)
## [1] 2
so the result is 2, not 2.5. The reason is that when we divide two integers, in C++ the result is always an integer.
To get the correct result we need to do the type conversion ourselves:
# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double mydiv(int n, int m) {
return n/double(m);
}
mydiv(5, 2)
## [1] 2.5
To be linked to R the C++ routine has to start with with these lines:
# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
}
The easiest way to get started is to just use RStudio - File - New File - C++ File.
If the C++ routine takes as its argument a single value it can be defined in the usual C++ way, as in the mydiv routine. If we want to use a vector as an argument or the return object we need to use a special variable type called NumericVector. It is just what it says it is.
Generally you should only turn fairly short code into C++, so debugging is not to big a problem. However, on occasion you might want to add a print statement to your code, so you can find out where it fails. here is how:
# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector dist5(NumericVector x, NumericVector y) {
int n=x.length();
NumericVector dist(n);
for(int i=0; i<n; ++i) {
dist[i]=sqrt(x[i]*x[i]+y[i]*y[i]);
Rcout<<i<<" "<<dist(i)<<"\n";
}
return dist;
}
dist5(x[1:5], y[1:5])
## 0 1.2107
## 1 0.610054
## 2 2.07795
## 3 0.532629
## 4 2.13605
## [1] 1.2107 0.6101 2.0779 0.5326 2.1360
Not only is Rcpp vectorized, many standard R functions have been ported to Rcpp.
We want to write a routine that simulates Brownian motion in \(R^2\). That is, a stochastic process that moves as follows: if at time \(t\) it is at \((x_0, y_0)\), then at time \(t+\delta\) it is at \((x_0+\delta X, y_0+\delta Y)\) where \(X,Y\sim N(0, 1)\).
Note: generating stochastic process can be quite slow in R because they are difficult to vectorize, with one step of a loop depending on the previous one.
The output of our function is going to be a nx2 matrix, so we will use the data type NumericMatrix.
# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix bw(int n, double delta) {
NumericMatrix xy(n, 2);
xy(0, 0) = 0;
xy(0, 1) = 0;
for(int i=2; i<n; ++i) {
xy(i, 0) = xy(i-1, 0) + delta*rnorm(1)[0];
xy(i, 1) = xy(i-1, 1) + delta*rnorm(1)[0];
}
return xy;
}
plot(bw(10000, 1),
type = "l",
xlab = "x",
ylab = "y",
col = "blue")
Notice the term rnorm(1)[0], a little different from the standard R usage.
Anyone (like me!) who ever had to write a routine in C++ and needed a simple routine like rnorm which does not exist in C++ will find this very sweet! And that is why it is called sugar!
let’s write a routine that calculates the golden ratio via the Fibonacci numbers. First, these are defined by
\[ \begin{aligned} &n_0 = 1\\ &n_1 = 1\\ &n_k = n_{k-1}+n_{k-2}\\ \end{aligned} \] and the golden ratio is the limit
\[ \lim_{k \rightarrow \infty} \frac{n_k}{n_{k-1}} \] because of its definition the Fibonacci numbers are most easily calculated using recursion:
fibR <- function(n) {
if(n==0) return(0)
if(n==1) return(1)
return (fibR(n-1)+fibR(n-2))
}
fibR(10)
## [1] 55
Let’s write this with Rcpp:
# include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int fib_cpp(const int n) {
if(n==0) return(0);
if(n==1) return(1);
return fib_cpp(n-1)+fib_cpp(n-2);
}
fib_cpp(10)
## [1] 55
and now for the golden ratio:
golden_ratio <- function(n, fun) {
fun(n)/fun(n-1)
}
mb <- microbenchmark(golden_ratio(10, fun=fibR),
golden_ratio(10, fun=fib_cpp))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 163.9 6.5
not only is the cpp version much faster, in this example R is actually quite useless: while it does recursion, doing it to often quickly becomes a problem (because of memory issues).
Now
golden_ratio(30, fun=fib_cpp)
## [1] 1.618
The actual value of the golden ratio is of course \(\frac{1+\sqrt{5}}2 =1.618..\).
So far we used C++ to speed up calculations, and we could use Sugar to call standard R functions in the C++ routine. There is another use, though. C++ has been the main computing language in many fields for a long time, and so there exist a large number of excellent routines already written. Say you found one of these and want to use it in your R program. Here is how:
What we need to do is to write a Rcpp wrapper routine, that eventually calls the C++ subroutine. Here is an example:
# include <Rcpp.h>
using namespace Rcpp;
//function declaration
double sum_of_squares(double x[], int n);
// [[Rcpp::export]]
double sub_routine(NumericVector x) {
int n=x.length();
double y[n];
double ssq;
for(int i=0;i<n;++i) y[i]=x[i];
ssq=sum_of_squares(y, n);
return ssq;
}
double sum_of_squares (double x[], int n)
{
double r=0.0;
for(int i=0;i<n;++i) r+=x[i]*x[i];
return r;
}
sub_routine(1:10)
## [1] 385
The sum_of_squares routine is pure C++, like any you might find on the web!
Exercise
Say we have a vector of numbers \((x_1,.,,x_n)\) and we want to find the k-means, that is \(\frac1{k}\sum_{i=j}^{i=j+k-1} x_i\). Here is a routine in R:
kmeans <- function(x, k){
n <- length(x)
y <- rep(0, n-k+1)
for(i in 1:(n-k+1))
y[i] <- mean(x[i:(i+k-1)])
y
}
Rewrite the routine as C++ and compare the speeds on
set.seed(111)
x <- rnorm(1e5)
and k=5
Let’s write this with Rcpp:
Let’s check first that the two routine do the same job:
rbind(kmeans(x[1:10], 5),
kmeans_cpp(x[1:10], 5))
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.5761 -0.5951 -0.8284 -0.9681 -0.6973 -0.762
## [2,] -0.5761 -0.5951 -0.8284 -0.9681 -0.6973 -0.762
and now
mb <- microbenchmark(kmeans(x, 5), kmeans_cpp(x, 5))
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 794106 1202