Parallel and GPU Computing

library(microbenchmark)

Many modern computers have several processor cores. This makes it easy to do parallel computing with R. Also, many simulation problems are embarrassingly parallel, that is they can be run in parallel very easily.

Using multiple processors

There are a number of packages that help here. I will discuss

library(parallel)

If you don’t know how many processors (called cores) you computer has your can check:

detectCores()
## [1] 6

It is usually a good idea to leave one core for other tasks, so let’s use

num_cores <- detectCores()-1

of them.

Let’s consider a simple simulation problem. We wish to study the standard estimators of the least squares regression line. That is we have a data set of the form \((x, y)\) and we want to fit an equation of the form \(y=\beta_0 + \beta_1 x + \epsilon\), where \(\epsilon \sim N(0, \sigma)\). The parameters \(\beta_0\), \(\beta_1\) and \(\sigma\) are estimated by minimizing the least squares criterion

\[ L(\beta_0, \beta_1) = \sum_{i=1}^n \left( y-\beta_0-\beta_1x \right)^2 \]

We can use the R function lm to this:

x <- 1:20
y <- 5 + 2*x + rnorm(10, 0, 3)
fit <- lm(y~x)
plot(x, y)
abline(fit)

coef(fit)
## (Intercept)           x 
##       4.500       2.037

Now a simulation will be fix some numbers \(n\), \(\beta_0\), \(\beta_1\) and \(\sigma\), generate \(B\) data sets and find the coefficients. Finally it will study the estimates.

sim_lm <- function(param) {
  beta0 <- param[1]
  beta1 <- param[2]
  sigma <- param[3]
  n <- param[4]
  B <- param[5]
  coefs <- matrix(0, B, 2)
  x <- 1:n
  for(i in 1:B) {
    y <- beta0 + beta1*x + rnorm(n, 0, sigma)
    coefs[i, ] <- coef(lm(y~x))
  }
  coefs
}
tm <- proc.time()
z1 <- sim_lm(c(5, 2, 3, 20, 50000))
tm <- round(proc.time()-tm)[3]
tm
## elapsed 
##      24

so this takes almost 24 seconds. In real life we would repeat this now for different values of the parameters, so you see this can take quite some time. Instead let’s parallelize the task:

cl <- makeCluster(num_cores) 
params <- c(5, 2, 3, 20, 10000)
tm <- proc.time()
z2<-clusterCall(cl, sim_lm, params)
tm <- round(proc.time()-tm)[3]
tm
## elapsed 
##       6

and so this took only about 6 seconds!

Did it really calculate the same thing? Let’s see. Please note that parallel returns a list, one for each cluster:

par(mfrow=c(2, 2))
hist(z1[, 1], 100, 
     main = expression(paste(beta[0], ", Single")),
     xlab = "")
hist(z1[, 2], 100, 
     main=expression(paste(beta[1], ", Single")),
     xlab = "")
a <- rbind(z2[[1]], z2[[2]], z2[[3]], z2[[4]], z2[[5]])
hist(a[, 1], 100, 
     main = expression(paste(beta[0], ", Parallel")),
     xlab = "")
hist(a[, 2], 100, 
     main = expression(paste(beta[1], ", Parallel")),
     xlab = "")

Certainly looks like it!

We previously discussed the apply family of functions. If one of these is what you want to use, they have equivalents in parallel.

So say we have a large matrix and want to find the maximum in each row:

B <- 1e5
A <- matrix(runif(10*B), B, 10)
tm <- proc.time()
a <- apply(A, 1, max)
proc.time()-tm
##    user  system elapsed 
##    0.11    0.00    0.11
tm <- proc.time()
a <- parRapply(cl, A, max)
proc.time()-tm
##    user  system elapsed 
##    0.03    0.00    0.07

In general the easiest case of parallizing a calculation is if you have already used the lapply command: Let’s say we have the scores of students in a number of exams. Because each exam had a different number of students we have the data organized as a list:

cat("Exam 1: ", grades$Exam_1, "\n")
## Exam 1:  68 73 67 72 85 74 67 66 75 71
cat("Exam 2: ", grades$Exam_2, "\n")
## Exam 2:  76 72 64 72 70 75 72 65 73 79 71 71 72 79 68
#...
cat("Exam 50: ", grades$Exam_50, "\n")
## Exam 50:  61 69 76 66 68

Now we want to find the minimum , mean, standard deviation and maximum of each exam. We can do that with

grade.summary <- function(x) round(c(min(x), 
                             mean(x), 
                             sd(x), 
                             max(x)), 1)
z <- lapply(grades, grade.summary)
z[1:2]
## $Exam_1
## [1] 66.0 71.8  5.6 85.0
## 
## $Exam_2
## [1] 64.0 71.9  4.3 79.0

and to run this in parallel:

z <- parLapply(cl, grades, grade.summary)
z[1:2]
## $Exam_1
## [1] 66.0 71.8  5.6 85.0
## 
## $Exam_2
## [1] 64.0 71.9  4.3 79.0

When you are done with the parallel calculations

stopCluster(cl)

Example (Umbrella problem)

You go every day (during the week) from your houe to school in the mornign and back in the evening. You own n umbrellas. If it rains and you have an umbrella you take it, otherwise you get wet. The probability of rain in the morning is p and in the evening is q. What is the probability to get wet?

Let’s write a simulation to answer this question:

umbrella <- function(params) {
  n <- params[1]
  B <- params[2]
  p <- params[3]
  q <- params[4]
  x <- sample(1:n, 1) #number of umbrellas at home
  num.got.wet <- 0
# umbrellas at home first day in the morning
  for(i in 1:B) {
    if(runif(1)<p) {#It's raining in the morning 
       if(x==0) num.got.wet=num.got.wet+1 
       else  x <- x-1 #one less umbrella at home
    } 
    if(runif(1)<q) {#It's raining in the evening
       if(n-x==0) num.got.wet=num.got.wet+1 
       else  x <- x+1 #one less umbrella at home
    } 
  }
  num.got.wet/B
}
a <- umbrella(c(4, 1e4, 0.2, 0.5))
params <- c(4, 2*1e3, 0.2, 0.5)
cl <- makeCluster(num_cores) 
b <- clusterCall(cl, umbrella, params)
a
## [1] 0.3005
unlist(b)
## [1] 0.2880 0.3180 0.2985 0.2940 0.3035
mean(unlist(b))
## [1] 0.3004
mb <- microbenchmark(
      umbrella(c(4, 1e4, 0.2, 0.5)),
      clusterCall(cl, umbrella, params))  
round(as.numeric(summary(mb)[, "median"]), 2)
## [1] 27.02  7.12

GPU Programming

20 years ago or so there was a lot of talk about massively parallel computing. This was the idea of using 100s or 1000s of cpu’s (computer processing units). It never went very far because such computers were way to expensive.

However, there was one area were such chips were in fact developed, namely for graphics cards. The difference is that these cpus are extremely simple, they don’t have to do much, just determine the colors of a few pixels. Eventually it occurred to people that as long as the computations were simple as well, such pug’s (graphics processing units), could also be used for other purposes. So if your computer has a dedicated graphics card you can do this.

Not all cards, however, will work. The most widely available ones are NVIDIA, but some others work as well.

gpuR

To use gpuR you need to get the spur library. Unlike most libraries this one is distributed as a source, so before you can use it it needs to be compiled. This will happen automatically but does take a bit of time.

To make sure you have all that is needed install the package and then run

library(gpuR)
## Number of platforms: 1
## - platform: NVIDIA Corporation: OpenCL 1.2 CUDA 9.1.75
##   - context device index: 0
##     - GeForce GTX 1050 Ti
## checked all devices
## completed initialization
detectGPUs()
## [1] 1

The spur package has mostly routines for matrix algebra. So let’s say we have a large matrix which we want to invert.

A <- matrix(rnorm(100), 10, 10)
summary(microbenchmark(solve(A)))["median"]
##   median
## 1   18.9

To use spur we have to turn the matrix into a spur object, and then we can run solve again:

A_gpuR <- vclMatrix(A, type="float")
summary(microbenchmark(solve(A_gpuR)))["median"]
##   median
## 1  4.835

Most linear algebra methods have been created to be executed for the matrix and convector objects. These methods include basic arithmetic functions %*%, +, -, *, /, t, , cross prod, crosspatch, col Means, col Sums, row Mean, and row Sums.

Math functions include sin, sin, sing, cos, cos, cosh, tan, tan, tang, log, log10, exp, abs, max, admin. Additional operations include some linear algebra routines such as cob(Pearson Covariance) and eigen.

A few ’distance’ routines have also been added with the dist and distance (for pairwise) functions. These currently include ’Euclidean’ and ’SqEuclidean’ methods.