One of the most useful features of R is it’s ability to do math on vectors. In fact we have already used this feature many times, but now we will study it explicitly.
options(digits=4)
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
x^2
## [1] 1 4 9 16 25 36 49 64 81 100
sqrt(x)
## [1] 1.000 1.414 1.732 2.000 2.236 2.449 2.646 2.828 3.000 3.162
x^2*exp(-(x/10)^2)
## [1] 0.990 3.843 8.225 13.634 19.470 25.116 30.019 33.747 36.034 36.788
y <- rep(1:2, 5)
y
## [1] 1 2 1 2 1 2 1 2 1 2
x+y
## [1] 2 4 4 6 6 8 8 10 10 12
x^2+y^2
## [1] 2 8 10 20 26 40 50 68 82 104
To some degree that also works for matrices:
x <- matrix(1:10, nrow=2)
x
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
x^2
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 9 25 49 81
## [2,] 4 16 36 64 100
but this can also fail!
R can also handle basic matrix calculations:
x
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
y
## [,1] [,2]
## [1,] 2 2
## [2,] 2 2
x*y
## [,1] [,2]
## [1,] 2 4
## [2,] 6 8
so this does element wise multiplication. If we want to do actual matrix multiplication we have
x %*% y # an infix operator!
## [,1] [,2]
## [1,] 6 6
## [2,] 14 14
rbind(1:3) %*% cbind(1:3)
## [,1]
## [1,] 14
we can use the solve command to solve a system of linear equations. Say we have the system
\[ \begin{aligned} &2x+3y-z = 1\\ &x-y = 0\\ &y+3z = 2\\ \end{aligned} \]
A <- rbind(c(2, 3, -1), c(1, -1, 0), c(0, 1, 3))
A
## [,1] [,2] [,3]
## [1,] 2 3 -1
## [2,] 1 -1 0
## [3,] 0 1 3
solve(A, c(1, 0, 2))
## [1] 0.3125 0.3125 0.5625
or to find the inverse matrix:
solve(A)
## [,1] [,2] [,3]
## [1,] 0.1875 0.625 0.0625
## [2,] 0.1875 -0.375 0.0625
## [3,] -0.0625 0.125 0.3125
transposition is done with
t(A)
## [,1] [,2] [,3]
## [1,] 2 1 0
## [2,] 3 -1 1
## [3,] -1 0 3
Note so careful when using t as the name of an object!
As you know (I hope?) from linear algebra any (non-singular square) matrix A can be written in the form
\[A=UDU^{-1}\] where D is a diagonal matrix. One use of this is we can then easily find
\[ \begin{aligned} &A^2 = AA = \\ &UDU^{-1}UDU^{-1} = \\ &UDDU^{-1} =\\ &UD^2U^{-1} \\ \end{aligned} \] and with induction we get
\[A^n=UD^nU^{-1}\]
Example Say
A <- matrix(1:9/10, 3, 3)
z <- eigen(A)
z
## eigen() decomposition
## $values
## [1] 1.612e+00 -1.117e-01 5.239e-17
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.4645 -0.8829 0.4082
## [2,] -0.5708 -0.2395 -0.8165
## [3,] -0.6770 0.4039 0.4082
z$vectors %*% diag(z$values^2) %*% solve(z$vectors)
## [,1] [,2] [,3]
## [1,] 0.30 0.66 1.02
## [2,] 0.36 0.81 1.26
## [3,] 0.42 0.96 1.50
A %*% A
## [,1] [,2] [,3]
## [1,] 0.30 0.66 1.02
## [2,] 0.36 0.81 1.26
## [3,] 0.42 0.96 1.50
z$vectors %*% diag(z$values^10) %*% solve(z$vectors)
## [,1] [,2] [,3]
## [1,] 13.25 30.00 46.75
## [2,] 16.28 36.86 57.45
## [3,] 19.31 43.72 68.14
Other functions for matrices are qr for decomposition and svd for singular value decomposition. There are also packages for dealing with things like sparse matrices etc.
When you write your own functions you should write them in such a way that they in turn are vectorized, that is can be applied to vectors. Here is one way to this. Consider the function integrate, which does numerical integration:
f <- function(x) {x^2}
I <- integrate(f, 0, 1)
is.list(I)
## [1] TRUE
names(I)
## [1] "value" "abs.error" "subdivisions" "message"
## [5] "call"
as we see the result of a call to the integrate function is a list. The important part is the value, so we can write
integrate(f, 0, 1)$value
## [1] 0.3333
but let’s say we want to calculate this integral for an interval of the form [0, A], not just [0, 1]. Here A might be many possible values. We can do this:
fA <- function (A) integrate(f, 0, A)$value
fA(1)
## [1] 0.3333
fA(2)
## [1] 2.667
but not
fA(1:2)
## Error in integrate(f, 0, A): length(upper) == 1 is not TRUE
so we need to “vectorize”:
fAvec <- Vectorize(fA)
fAvec(c(1, 2))
## [1] 0.3333 2.6667
This works fine, but does have some drawbacks. General functions like Vectorize have to work in a great many different cases, so they need to do a lot of checking, which takes time to do. In practice it is often better to vectorize your routine yourself:
fA.vec <- function (A) {
y <- 0*A
for(i in seq_along(A))
y[i] <- integrate(f, 0, A[i])$value
y
}
fA.vec(c(1, 2))
## [1] 0.3333 2.6667
Once you have a function that does something for one value, vectorizing it is (usually) very quick and almost always worthwhile!
There is a set of routines that can be used to vectorize. Say we want to do a simulation to study the variance of the mean and the median in a sample from the normal distribution.
sim1 <- function(n, B=1e4) {
y <- matrix(0, B, 2)
for(i in 1:B) {
x <- rnorm(n)
y[i, 1] <- mean(x)
y[i, 2] <- median(x)
}
c(sd(y[, 1]), sd(y[, 2]))
}
sim1(50)
## [1] 0.1427 0.1746
Here is an alternative:
sim2 <- function(n, B=1e4) {
x <- matrix(rnorm(n*B), B, 50)
c(sd(apply(x, 1, mean)), sd(apply(x, 1, median)))
}
sim2(50)
## [1] 0.1410 0.1737
Now this obviously has the advantage of being shorter.
If you read books on R written more than a few year ago you find many comments warning against the use of loops. They used to be very slow, much slower than using apply. Let’s check the speed of the calculation with the microbenchmark package:
library(microbenchmark)
microbenchmark(sim1(50), times = 10)
## Unit: milliseconds
## expr min lq mean median uq max neval
## sim1(50) 358.2 362.8 372.8 371.8 377.9 401.8 10
microbenchmark(sim2(50), times = 10)
## Unit: milliseconds
## expr min lq mean median uq max neval
## sim2(50) 361 374.4 399.7 387.2 410.6 502.5 10
so the loop is actually faster! A few versions ago the whole implementation of loops in R was rewritten, and these days they are actually quite fast! That still leaves the advantage of short code.
There are variants of apply for other data structures:
x <- list(A=rnorm(10), B=rnorm(20), C=rnorm(30))
lapply(x, mean)
## $A
## [1] 0.1512
##
## $B
## [1] 0.09162
##
## $C
## [1] -0.1193
as we see the result is again a list. Notice that this goes against the way R is generally written: the resulting object could be type-converted to the simpler vector but is not.
Often we want it to be a vector. We could use unlist, or
sapply(x, mean)
## A B C
## 0.15121 0.09162 -0.11930
apply a function to the numbers in one vector, grouped by the values in another (categorical) vector:
GPA <- round(runif(100, 2, 4), 1)
Gender <- sample(c("Male", "Female"),
size=100, replace=TRUE)
tapply(GPA, Gender, mean)
## Female Male
## 3.027 2.991
Here is another less obvious example:
par(mfrow=c(1,2))
tapply(GPA, Gender, hist, breaks=50,
main="", xlab="", ylab="")