here are some useful commands for vectors:
x <- c(1, 2, 3, 4, 5, 6)
x
## [1] 1 2 3 4 5 6
length(x)
## [1] 6
names(x) <- LETTERS[1:6]
x
## A B C D E F
## 1 2 3 4 5 6
If we have several vectors of the same type and length which belong together we can put them in a
x <- c(1, 2, 3, 4, 5, 6)
y <- c(4, 2, 3, 5, 3, 4)
z <- c(3, 4, 2, 3, 4, 2)
cbind(x, y, x)
## x y x
## [1,] 1 4 1
## [2,] 2 2 2
## [3,] 3 3 3
## [4,] 4 5 4
## [5,] 5 3 5
## [6,] 6 4 6
rbind(x, y, z)
## [,1] [,2] [,3] [,4] [,5] [,6]
## x 1 2 3 4 5 6
## y 4 2 3 5 3 4
## z 3 4 2 3 4 2
here are several other ways to make a matrix:
matrix(x, 2, 3)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
matrix(x, ncol=2)
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
matrix(0, 3, 3)
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
## [3,] 0 0 0
diag(3)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
we can control how the matrix is filled in:
matrix(1:8, nrow=2)
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
matrix(1:8, nrow=2, byrow = TRUE)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
just like vectors a matrix is all of the same type, with R doing type conversion if necessary:
matrix(c(1, 2, "A", 3), 2, 2)
## [,1] [,2]
## [1,] "1" "A"
## [2,] "2" "3"
useful commands for R matrices are:
x <- matrix(c(1, 2, 3, 4, 5, 6), 2, 3)
x
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
dim(x)
## [1] 2 3
nrow(x)
## [1] 2
ncol(x)
## [1] 3
dimnames(x) <- list(c("A", "B"), c("a", "b", "c"))
x
## a b c
## A 1 3 5
## B 2 4 6
colnames(x)
## [1] "a" "b" "c"
rownames(x)
## [1] "A" "B"
colnames(x) <- c("Height", "Age 1", "Age 2")
x
## Height Age 1 Age 2
## A 1 3 5
## B 2 4 6
Actually I would recommend
colnames(x) <- c("Height", "Age.1", "Age.2")
x
## Height Age.1 Age.2
## A 1 3 5
## B 2 4 6
empty spaces in names are allowed but can on occasion lead to problems, so avoiding them is a good idea. Of course they are not nice as labels in graphs or as titles in tables, but it is usually easy to change those.
Sometimes a matrix is created by a function and given row and column names automatically, but we don’t want it to have them. They can be removed with
rownames(x) <- NULL
x
## Height Age.1 Age.2
## [1,] 1 3 5
## [2,] 2 4 6
Here is a strange way to make a matrix:
x <- 1:6
x
## [1] 1 2 3 4 5 6
dim(x) <- c(2, 3)
x
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Why does this work? dim is an attribute of an object, by changing this attribute we are changing the object.
What happens if we try to make a matrix out of a vector that doesn’t have the right number of entries?
matrix(1:5, 2, 2)
## Warning in matrix(1:5, 2, 2): data length [5] is not a sub-multiple or
## multiple of the number of rows [2]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
not surprisingly it just uses the elements needed.
matrix(1:3, 2, 2)
## Warning in matrix(1:3, 2, 2): data length [3] is not a sub-multiple or
## multiple of the number of rows [2]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 1
In this case R just starts all over again. This behavior is called recycling (in computer science, not just R)
In either case R gives a warning. Except in rare cases you should try to avoid these things, they are usually the consequence of bad programming.
an array is a k-dimensional matrix. For example
array(1:8, dim=c(2, 2, 2))
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
Often these come about as a 3-way table:
x <- sample(1:3, size=20, replace=T)
y <- sample(c("F", "M"), size=20, replace = T)
z <- sample(c("a", "b", "c"), size=20, replace = T)
xyz <- table(x,y,z)
xyz
## , , z = a
##
## y
## x F M
## 1 1 2
## 2 1 0
## 3 0 1
##
## , , z = b
##
## y
## x F M
## 1 1 0
## 2 4 2
## 3 1 0
##
## , , z = c
##
## y
## x F M
## 1 4 0
## 2 0 0
## 3 1 2
The commands for arrays are similar to those for matrices:
dim(xyz)
## [1] 3 2 3
dimnames(xyz)
## $x
## [1] "1" "2" "3"
##
## $y
## [1] "F" "M"
##
## $z
## [1] "a" "b" "c"
sometimes we have several vectors of the same length but of different types, then we can put them together as a data frame:
x <- c(1, 2, 3, 4, 5, 6)
y <- c("a", "a", "b", "c", "a", "c")
z <- c(T, T, F, T, F, T)
xyz <- data.frame(x, y, z)
xyz
## x y z
## 1 1 a TRUE
## 2 2 a TRUE
## 3 3 b FALSE
## 4 4 c TRUE
## 5 5 a FALSE
## 6 6 c TRUE
This type of data is very common in Statistics, and data frames have been the standard data type from its beginning. In general when you get data from a source like the internet it will be as a data frame.
The same commands as for matrices work for data frames as well:
dim(xyz)
## [1] 6 3
nrow(xyz)
## [1] 6
ncol(xyz)
## [1] 3
dimnames(xyz) <- list(letters[1:6], c("a", "b", "c"))
xyz
## a b c
## a 1 a TRUE
## b 2 a TRUE
## c 3 b FALSE
## d 4 c TRUE
## e 5 a FALSE
## f 6 c TRUE
colnames(xyz)
## [1] "a" "b" "c"
rownames(xyz)
## [1] "a" "b" "c" "d" "e" "f"
Say we want to add another column (variable) to the data frame:
xyz[[4]] <- (1:6)^2
colnames(xyz)[4] <- "squares"
xyz
## a b c squares
## a 1 a TRUE 1
## b 2 a TRUE 4
## c 3 b FALSE 9
## d 4 c TRUE 16
## e 5 a FALSE 25
## f 6 c TRUE 36
If we want to get rid of a column:
xyz[[4]] <- NULL
There is a strange default behavior of data frames: they turn strings into factors:
df <- data.frame(x=1:5, y=letters[1:5])
str(df)
## 'data.frame': 5 obs. of 2 variables:
## $ x: int 1 2 3 4 5
## $ y: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
You can prevent this from happening, though:
df <- data.frame(x = 1:5,
y = letters[1:5],
stringsAsFactors = FALSE)
str(df)
## 'data.frame': 5 obs. of 2 variables:
## $ x: int 1 2 3 4 5
## $ y: chr "a" "b" "c" "d" ...
In fact, this is what I want to happen almost all the time. So I change that option globally whenever I run R or RStudio. I will show you how to do that at some point.
If a data frame is all of the same type we can use as.matrix to turn it into a matrix.
Exercise
what does as.matrix() do when it is applied to a data frame with columns of different types?
Finally, if the vectors aren’t even of the same lengths we have
x <- c(1, 2, 3, 4, 5, 6)
y <- c("a", "a", "b")
z <- c(T, T)
xyz <- list(x, y, z)
xyz
## [[1]]
## [1] 1 2 3 4 5 6
##
## [[2]]
## [1] "a" "a" "b"
##
## [[3]]
## [1] TRUE TRUE
lists are displayed quite differently from the other formats. Here are a number of commands:
length(xyz)
## [1] 3
names(xyz) <- c("Count", "Letter", "Married?")
xyz
## $Count
## [1] 1 2 3 4 5 6
##
## $Letter
## [1] "a" "a" "b"
##
## $`Married?`
## [1] TRUE TRUE
Often we want to use a list inside a function to record various values. So we need to create an “empty” list of a certain length:
x <- as.list(1:3)
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
and if we run out of space:
x[[4]] <- "a"
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] "a"
Internally R stores all data as lists. In fact anything can be an element of a list:
In Statistics we often have data of the form \((x, y)\) and we want to find a linear model. For example consider the data on wine consumption and life expectancy:
kable.nice(wine)
Country | Wine.Consumption | Heart.Disease.Deaths | |
---|---|---|---|
1 | Australia | 2.5 | 211 |
2 | Austria | 3.9 | 167 |
3 | Belgium | 2.9 | 131 |
4 | Canada | 2.4 | 191 |
5 | Denmark | 2.9 | 220 |
6 | Finland | 0.8 | 297 |
7 | France | 9.1 | 71 |
8 | Iceland | 0.8 | 211 |
9 | Ireland | 0.7 | 300 |
10 | Italy | 7.9 | 107 |
11 | Netherlands | 1.8 | 167 |
12 | New Zealand | 1.9 | 266 |
13 | Norway | 0.8 | 227 |
14 | Spain | 6.5 | 86 |
15 | Sweden | 1.6 | 207 |
16 | Switzerland | 5.8 | 115 |
17 | United Kingdom | 1.3 | 285 |
18 | United States | 1.2 | 199 |
19 | Germany | 2.7 | 172 |
attach(wine)
plot(Wine.Consumption, Heart.Disease.Deaths,
xlab = "Wine Consumption",
ylab = "Heart Disease per 100000")
Note The data set wine and the routine splot are part of a large collection of data sets and routines that I use in many of my courses. You can download the file at http://academic.uprm.edu/wrolke/Resma3/Resma3.RData.
Let’s say we want to fit a linear model:
fit <- lm(Heart.Disease.Deaths~Wine.Consumption)
plot(Wine.Consumption, Heart.Disease.Deaths,
xlab = "Wine Consumption",
ylab = "Heart Disease per 100000")
abline(fit)
Let’s say we want to save the data and the fit for future use:
wine.all <- list(data=wine, fit=fit)
length(wine.all)
## [1] 2
data frames are lists, with the additional requirement that each column has the same length. But a “column” need not be what you think:
z <- list(fit, fit, fit)
length(z)
## [1] 3
df <- data.frame(x=1:3, y=letters[1:3])
df$z <- z
dim(df)
## [1] 3 3
This works, but I don’t recommend it. A list is likely a better option here.
Example
Say we have the following data set: in each of 10 experiments 5 measurements in five different locations coded as A-E were taken. The result was stored as a list, with each set of measurements an element:
results
## $`Experiment 1`
## A B C D E
## 102.5 110.9 96.3 96.5 88.4
##
## $`Experiment 2`
## A B C D E
## 64.6 96.0 94.2 64.3 79.6
##
## $`Experiment 3`
## A B C D E
## 120.2 116.4 138.7 100.8 104.1
##
## $`Experiment 4`
## A B C D E
## 67.7 113.4 93.3 89.6 90.2
##
## $`Experiment 5`
## A B C D E
## 135.5 88.8 105.1 57.1 89.0
##
## $`Experiment 6`
## A B C D E
## 98.5 101.0 117.3 83.2 116.4
##
## $`Experiment 7`
## A B C D E
## 56.9 98.3 118.9 104.1 86.5
##
## $`Experiment 8`
## A B C D E
## 95.3 81.3 94.3 93.4 108.0
##
## $`Experiment 9`
## A B C D E
## 127.1 95.1 125.0 98.5 108.5
##
## $`Experiment 10`
## A B C D E
## 87.7 103.2 85.9 120.7 84.5
Now we want to find the means by locations. One way would be to loop over the list elements, but much easier is:
Reduce(`+`, results)/length(results)
## A B C D E
## 95.60 100.44 106.90 90.82 95.52