Vector Arithmetic

One of the most useful features of R is it’s ability to do math on vectors. In fact we have already used this feature many times, but now we will study it explicitly.

options(digits=4)
x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x^2
##  [1]   1   4   9  16  25  36  49  64  81 100
sqrt(x)
##  [1] 1.000 1.414 1.732 2.000 2.236 2.449 2.646 2.828 3.000 3.162
x^2*exp(-(x/10)^2)
##  [1]  0.990  3.843  8.225 13.634 19.470 25.116 30.019 33.747 36.034 36.788
y <- rep(1:2, 5)
y
##  [1] 1 2 1 2 1 2 1 2 1 2
x+y
##  [1]  2  4  4  6  6  8  8 10 10 12
x^2+y^2
##  [1]   2   8  10  20  26  40  50  68  82 104

To some degree that also works for matrices:

x <- matrix(1:10, nrow=2)
x
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
x^2
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    9   25   49   81
## [2,]    4   16   36   64  100

but this can also fail!

Matrix Algebra

R can also handle basic matrix calculations:

x
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
y
##      [,1] [,2]
## [1,]    2    2
## [2,]    2    2
x*y
##      [,1] [,2]
## [1,]    2    4
## [2,]    6    8

so this does element wise multiplication. If we want to do actual matrix multiplication we have

x %*% y # an infix operator!
##      [,1] [,2]
## [1,]    6    6
## [2,]   14   14
rbind(1:3) %*% cbind(1:3)
##      [,1]
## [1,]   14

we can use the solve command to solve a system of linear equations. Say we have the system

\[ \begin{aligned} &2x+3y-z = 1\\ &x-y = 0\\ &y+3z = 2\\ \end{aligned} \]

A <- rbind(c(2, 3, -1), c(1, -1, 0), c(0, 1, 3))
A
##      [,1] [,2] [,3]
## [1,]    2    3   -1
## [2,]    1   -1    0
## [3,]    0    1    3
solve(A, c(1, 0, 2))
## [1] 0.3125 0.3125 0.5625

or to find the inverse matrix:

solve(A)
##         [,1]   [,2]   [,3]
## [1,]  0.1875  0.625 0.0625
## [2,]  0.1875 -0.375 0.0625
## [3,] -0.0625  0.125 0.3125

transposition is done with

t(A)
##      [,1] [,2] [,3]
## [1,]    2    1    0
## [2,]    3   -1    1
## [3,]   -1    0    3

Note so careful when using t as the name of an object!

As you know (I hope?) from linear algebra any (non-singular square) matrix A can be written in the form

\[A=UDU^{-1}\] where D is a diagonal matrix. One use of this is we can then easily find

\[ \begin{aligned} &A^2 = AA = \\ &UDU^{-1}UDU^{-1} = \\ &UDDU^{-1} =\\ &UD^2U^{-1} \\ \end{aligned} \] and with induction we get

\[A^n=UD^nU^{-1}\]

Example Say

A <- matrix(1:9/10, 3, 3)
z <- eigen(A)
z
## eigen() decomposition
## $values
## [1]  1.612e+00 -1.117e-01  5.239e-17
## 
## $vectors
##         [,1]    [,2]    [,3]
## [1,] -0.4645 -0.8829  0.4082
## [2,] -0.5708 -0.2395 -0.8165
## [3,] -0.6770  0.4039  0.4082
z$vectors %*% diag(z$values^2) %*% solve(z$vectors)
##      [,1] [,2] [,3]
## [1,] 0.30 0.66 1.02
## [2,] 0.36 0.81 1.26
## [3,] 0.42 0.96 1.50
A %*% A
##      [,1] [,2] [,3]
## [1,] 0.30 0.66 1.02
## [2,] 0.36 0.81 1.26
## [3,] 0.42 0.96 1.50
z$vectors %*% diag(z$values^10) %*% solve(z$vectors)
##       [,1]  [,2]  [,3]
## [1,] 13.25 30.00 46.75
## [2,] 16.28 36.86 57.45
## [3,] 19.31 43.72 68.14

Other functions for matrices are qr for decomposition and svd for singular value decomposition. There are also packages for dealing with things like sparse matrices etc.

Vectorization

When you write your own functions you should write them in such a way that they in turn are vectorized, that is can be applied to vectors. Here is one way to this. Consider the function integrate, which does numerical integration:

f <- function(x) {x^2}
I <- integrate(f, 0, 1)
is.list(I)
## [1] TRUE
names(I)
## [1] "value"        "abs.error"    "subdivisions" "message"     
## [5] "call"

as we see the result of a call to the integrate function is a list. The important part is the value, so we can write

integrate(f, 0, 1)$value
## [1] 0.3333

but let’s say we want to calculate this integral for an interval of the form [0, A], not just [0, 1]. Here A might be many possible values. We can do this:

fA <- function (A) integrate(f, 0, A)$value
fA(1)
## [1] 0.3333
fA(2)
## [1] 2.667

but not

fA(1:2)
## Error in integrate(f, 0, A): length(upper) == 1 is not TRUE

so we need to “vectorize”:

fAvec <- Vectorize(fA)
fAvec(c(1, 2))
## [1] 0.3333 2.6667

This works fine, but does have some drawbacks. General functions like Vectorize have to work in a great many different cases, so they need to do a lot of checking, which takes time to do. In practice it is often better to vectorize your routine yourself:

fA.vec <- function (A) {
  y <- 0*A
  for(i in seq_along(A))
    y[i] <- integrate(f, 0, A[i])$value
  y
}  
fA.vec(c(1, 2))
## [1] 0.3333 2.6667

Once you have a function that does something for one value, vectorizing it is (usually) very quick and almost always worthwhile!

apply family of functions

There is a set of routines that can be used to vectorize. Say we want to do a simulation to study the variance of the mean and the median in a sample from the normal distribution.

sim1 <- function(n, B=1e4) {
  y <- matrix(0, B, 2)
  for(i in 1:B) {
    x <- rnorm(n)
    y[i, 1] <- mean(x)
    y[i, 2] <- median(x)  
  }
  c(sd(y[, 1]), sd(y[, 2]))  
}
sim1(50)
## [1] 0.1427 0.1746

Here is an alternative:

  • apply
sim2 <- function(n, B=1e4) {
  x <- matrix(rnorm(n*B), B, 50)
  c(sd(apply(x, 1, mean)), sd(apply(x, 1, median)))
}
sim2(50)
## [1] 0.1410 0.1737

Now this obviously has the advantage of being shorter.

If you read books on R written more than a few year ago you find many comments warning against the use of loops. They used to be very slow, much slower than using apply. Let’s check the speed of the calculation with the microbenchmark package:

library(microbenchmark)
microbenchmark(sim1(50), times = 10)
## Unit: milliseconds
##      expr   min    lq  mean median    uq   max neval
##  sim1(50) 358.2 362.8 372.8  371.8 377.9 401.8    10
microbenchmark(sim2(50), times = 10)
## Unit: milliseconds
##      expr min    lq  mean median    uq   max neval
##  sim2(50) 361 374.4 399.7  387.2 410.6 502.5    10

so the loop is actually faster! A few versions ago the whole implementation of loops in R was rewritten, and these days they are actually quite fast! That still leaves the advantage of short code.

There are variants of apply for other data structures:

  • lapply for lists:
x <- list(A=rnorm(10), B=rnorm(20), C=rnorm(30))
lapply(x, mean)
## $A
## [1] 0.1512
## 
## $B
## [1] 0.09162
## 
## $C
## [1] -0.1193

as we see the result is again a list. Notice that this goes against the way R is generally written: the resulting object could be type-converted to the simpler vector but is not.

Often we want it to be a vector. We could use unlist, or

sapply(x, mean)
##        A        B        C 
##  0.15121  0.09162 -0.11930
  • tapply for vectors

apply a function to the numbers in one vector, grouped by the values in another (categorical) vector:

GPA <- round(runif(100, 2, 4), 1)
Gender <- sample(c("Male", "Female"), 
      size=100, replace=TRUE)
tapply(GPA, Gender, mean)
## Female   Male 
##  3.027  2.991

Here is another less obvious example:

par(mfrow=c(1,2))
tapply(GPA, Gender, hist, breaks=50,
       main="", xlab="", ylab="")