Data Formats

Vectors

here are some useful commands for vectors:

x <- c(1, 2, 3, 4, 5, 6)
x
## [1] 1 2 3 4 5 6
length(x)
## [1] 6
names(x) <- LETTERS[1:6]
x
## A B C D E F 
## 1 2 3 4 5 6

If we have several vectors of the same type and length which belong together we can put them in a

Matrix

x <- c(1, 2, 3, 4, 5, 6)
y <- c(4, 2, 3, 5, 3, 4)
z <- c(3, 4, 2, 3, 4, 2)
cbind(x, y, x)
##      x y x
## [1,] 1 4 1
## [2,] 2 2 2
## [3,] 3 3 3
## [4,] 4 5 4
## [5,] 5 3 5
## [6,] 6 4 6
rbind(x, y, z)
##   [,1] [,2] [,3] [,4] [,5] [,6]
## x    1    2    3    4    5    6
## y    4    2    3    5    3    4
## z    3    4    2    3    4    2

here are several other ways to make a matrix:

matrix(x, 2, 3)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
matrix(x, ncol=2)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
matrix(0, 3, 3)
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0
## [3,]    0    0    0
diag(3)
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

we can control how the matrix is filled in:

matrix(1:8, nrow=2)
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
matrix(1:8, nrow=2, byrow = TRUE)
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8

just like vectors a matrix is all of the same type, with R doing type conversion if necessary:

matrix(c(1, 2, "A", 3), 2, 2)
##      [,1] [,2]
## [1,] "1"  "A" 
## [2,] "2"  "3"

useful commands for R matrices are:

x <- matrix(c(1, 2, 3, 4, 5, 6), 2, 3)
x
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
dim(x)
## [1] 2 3
nrow(x)
## [1] 2
ncol(x)
## [1] 3
dimnames(x) <- list(c("A", "B"), c("a", "b", "c"))
x
##   a b c
## A 1 3 5
## B 2 4 6
colnames(x)
## [1] "a" "b" "c"
rownames(x)
## [1] "A" "B"
colnames(x) <- c("Height", "Age 1", "Age 2")
x
##   Height Age 1 Age 2
## A      1     3     5
## B      2     4     6

Actually I would recommend

colnames(x) <- c("Height", "Age.1", "Age.2")
x
##   Height Age.1 Age.2
## A      1     3     5
## B      2     4     6

empty spaces in names are allowed but can on occasion lead to problems, so avoiding them is a good idea. Of course they are not nice as labels in graphs or as titles in tables, but it is usually easy to change those.

Sometimes a matrix is created by a function and given row and column names automatically, but we don’t want it to have them. They can be removed with

rownames(x) <- NULL
x
##      Height Age.1 Age.2
## [1,]      1     3     5
## [2,]      2     4     6

Here is a strange way to make a matrix:

x <- 1:6
x
## [1] 1 2 3 4 5 6
dim(x) <- c(2, 3)
x
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Why does this work? dim is an attribute of an object, by changing this attribute we are changing the object.



What happens if we try to make a matrix out of a vector that doesn’t have the right number of entries?

matrix(1:5, 2, 2)
## Warning in matrix(1:5, 2, 2): data length [5] is not a sub-multiple or
## multiple of the number of rows [2]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

not surprisingly it just uses the elements needed.

matrix(1:3, 2, 2)
## Warning in matrix(1:3, 2, 2): data length [3] is not a sub-multiple or
## multiple of the number of rows [2]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    1

In this case R just starts all over again. This behavior is called recycling (in computer science, not just R)

In either case R gives a warning. Except in rare cases you should try to avoid these things, they are usually the consequence of bad programming.

Arrays

an array is a k-dimensional matrix. For example

array(1:8, dim=c(2, 2, 2))
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8

Often these come about as a 3-way table:

x <- sample(1:3, size=20, replace=T)
y <- sample(c("F", "M"), size=20, replace = T)
z <- sample(c("a", "b", "c"), size=20, replace = T)
xyz <- table(x,y,z)
xyz
## , , z = a
## 
##    y
## x   F M
##   1 1 2
##   2 1 0
##   3 0 1
## 
## , , z = b
## 
##    y
## x   F M
##   1 1 0
##   2 4 2
##   3 1 0
## 
## , , z = c
## 
##    y
## x   F M
##   1 4 0
##   2 0 0
##   3 1 2

The commands for arrays are similar to those for matrices:

dim(xyz)
## [1] 3 2 3
dimnames(xyz)
## $x
## [1] "1" "2" "3"
## 
## $y
## [1] "F" "M"
## 
## $z
## [1] "a" "b" "c"

Data Frames

sometimes we have several vectors of the same length but of different types, then we can put them together as a data frame:

x <- c(1, 2, 3, 4, 5, 6)
y <- c("a", "a", "b", "c", "a", "c")
z <- c(T, T, F, T, F, T)
xyz <- data.frame(x, y, z)
xyz
##   x y     z
## 1 1 a  TRUE
## 2 2 a  TRUE
## 3 3 b FALSE
## 4 4 c  TRUE
## 5 5 a FALSE
## 6 6 c  TRUE

This type of data is very common in Statistics, and data frames have been the standard data type from its beginning. In general when you get data from a source like the internet it will be as a data frame.



The same commands as for matrices work for data frames as well:

dim(xyz)
## [1] 6 3
nrow(xyz)
## [1] 6
ncol(xyz)
## [1] 3
dimnames(xyz) <- list(letters[1:6], c("a", "b", "c"))
xyz
##   a b     c
## a 1 a  TRUE
## b 2 a  TRUE
## c 3 b FALSE
## d 4 c  TRUE
## e 5 a FALSE
## f 6 c  TRUE
colnames(xyz)
## [1] "a" "b" "c"
rownames(xyz)
## [1] "a" "b" "c" "d" "e" "f"

Say we want to add another column (variable) to the data frame:

xyz[[4]] <- (1:6)^2
colnames(xyz)[4] <- "squares"
xyz
##   a b     c squares
## a 1 a  TRUE       1
## b 2 a  TRUE       4
## c 3 b FALSE       9
## d 4 c  TRUE      16
## e 5 a FALSE      25
## f 6 c  TRUE      36

If we want to get rid of a column:

xyz[[4]] <- NULL

There is a strange default behavior of data frames: they turn strings into factors:

df <- data.frame(x=1:5, y=letters[1:5])
str(df)
## 'data.frame':    5 obs. of  2 variables:
##  $ x: int  1 2 3 4 5
##  $ y: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5

You can prevent this from happening, though:

df <- data.frame(x = 1:5, 
                 y = letters[1:5],
                 stringsAsFactors = FALSE)
str(df)
## 'data.frame':    5 obs. of  2 variables:
##  $ x: int  1 2 3 4 5
##  $ y: chr  "a" "b" "c" "d" ...

In fact, this is what I want to happen almost all the time. So I change that option globally whenever I run R or RStudio. I will show you how to do that at some point.

If a data frame is all of the same type we can use as.matrix to turn it into a matrix.

Exercise

what does as.matrix() do when it is applied to a data frame with columns of different types?


Finally, if the vectors aren’t even of the same lengths we have

Lists

x <- c(1, 2, 3, 4, 5, 6)
y <- c("a", "a", "b")
z <- c(T, T)
xyz <- list(x, y, z)
xyz
## [[1]]
## [1] 1 2 3 4 5 6
## 
## [[2]]
## [1] "a" "a" "b"
## 
## [[3]]
## [1] TRUE TRUE

lists are displayed quite differently from the other formats. Here are a number of commands:

length(xyz)
## [1] 3
names(xyz) <- c("Count", "Letter", "Married?")
xyz
## $Count
## [1] 1 2 3 4 5 6
## 
## $Letter
## [1] "a" "a" "b"
## 
## $`Married?`
## [1] TRUE TRUE

Often we want to use a list inside a function to record various values. So we need to create an “empty” list of a certain length:

x <- as.list(1:3)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3

and if we run out of space:

x[[4]] <- "a"
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] "a"

Internally R stores all data as lists. In fact anything can be an element of a list:

Example

In Statistics we often have data of the form \((x, y)\) and we want to find a linear model. For example consider the data on wine consumption and life expectancy:

kable.nice(wine)
Country Wine.Consumption Heart.Disease.Deaths
1 Australia 2.5 211
2 Austria 3.9 167
3 Belgium 2.9 131
4 Canada 2.4 191
5 Denmark 2.9 220
6 Finland 0.8 297
7 France 9.1 71
8 Iceland 0.8 211
9 Ireland 0.7 300
10 Italy 7.9 107
11 Netherlands 1.8 167
12 New Zealand 1.9 266
13 Norway 0.8 227
14 Spain 6.5 86
15 Sweden 1.6 207
16 Switzerland 5.8 115
17 United Kingdom 1.3 285
18 United States 1.2 199
19 Germany 2.7 172
attach(wine)
plot(Wine.Consumption, Heart.Disease.Deaths,
      xlab = "Wine Consumption",
      ylab = "Heart Disease per 100000")

Note The data set wine and the routine splot are part of a large collection of data sets and routines that I use in many of my courses. You can download the file at http://academic.uprm.edu/wrolke/Resma3/Resma3.RData.

Let’s say we want to fit a linear model:

fit <- lm(Heart.Disease.Deaths~Wine.Consumption)
plot(Wine.Consumption, Heart.Disease.Deaths,
      xlab = "Wine Consumption",
      ylab = "Heart Disease per 100000")
abline(fit)

Let’s say we want to save the data and the fit for future use:

wine.all <- list(data=wine, fit=fit)
length(wine.all)
## [1] 2

data frames are lists, with the additional requirement that each column has the same length. But a “column” need not be what you think:

z <- list(fit, fit, fit)
length(z)
## [1] 3
df <- data.frame(x=1:3, y=letters[1:3])
df$z <- z
dim(df)
## [1] 3 3

This works, but I don’t recommend it. A list is likely a better option here.

Example

Say we have the following data set: in each of 10 experiments 5 measurements in five different locations coded as A-E were taken. The result was stored as a list, with each set of measurements an element:

results
## $`Experiment 1`
##     A     B     C     D     E 
## 102.5 110.9  96.3  96.5  88.4 
## 
## $`Experiment 2`
##    A    B    C    D    E 
## 64.6 96.0 94.2 64.3 79.6 
## 
## $`Experiment 3`
##     A     B     C     D     E 
## 120.2 116.4 138.7 100.8 104.1 
## 
## $`Experiment 4`
##     A     B     C     D     E 
##  67.7 113.4  93.3  89.6  90.2 
## 
## $`Experiment 5`
##     A     B     C     D     E 
## 135.5  88.8 105.1  57.1  89.0 
## 
## $`Experiment 6`
##     A     B     C     D     E 
##  98.5 101.0 117.3  83.2 116.4 
## 
## $`Experiment 7`
##     A     B     C     D     E 
##  56.9  98.3 118.9 104.1  86.5 
## 
## $`Experiment 8`
##     A     B     C     D     E 
##  95.3  81.3  94.3  93.4 108.0 
## 
## $`Experiment 9`
##     A     B     C     D     E 
## 127.1  95.1 125.0  98.5 108.5 
## 
## $`Experiment 10`
##     A     B     C     D     E 
##  87.7 103.2  85.9 120.7  84.5

Now we want to find the means by locations. One way would be to loop over the list elements, but much easier is:

Reduce(`+`, results)/length(results)
##      A      B      C      D      E 
##  95.60 100.44 106.90  90.82  95.52