Sub-setting / Data Wrangling

Vectors

Consider the following vector:

x
##   A   B   C   D   E   F   G   H   I   J 
## 7.3 2.7 2.8 3.5 3.0 4.9 3.3 8.7 1.5 3.4

The elements of a vector are accessed with the bracket [ ] notation:

x[3]
##   C 
## 2.8
x[1:3]
##   A   B   C 
## 7.3 2.7 2.8
x[c(1, 3, 8)]
##   A   C   H 
## 7.3 2.8 8.7
x[-3]
##   A   B   D   E   F   G   H   I   J 
## 7.3 2.7 3.5 3.0 4.9 3.3 8.7 1.5 3.4
x[-c(1, 2, 5)]
##   C   D   F   G   H   I   J 
## 2.8 3.5 4.9 3.3 8.7 1.5 3.4

if a vector has names they can be used as well:

x["C"]
##   C 
## 2.8
x[c("A","D")]
##   A   D 
## 7.3 3.5

There are also strange things one can do and sometimes get away with:

x <- 1:10
names(x) <- letters[1:10]
x
##  a  b  c  d  e  f  g  h  i  j 
##  1  2  3  4  5  6  7  8  9 10
x[0]
## named integer(0)
x[3.1]
## c 
## 3
x[3.6]
## c 
## 3
x[a]
## Error in x[a]: invalid subscript type 'closure'

Another way to subset a vector is with logical conditions:

x[x > 4]
##  e  f  g  h  i  j 
##  5  6  7  8  9 10
x[x>4 & x<7]
## e f 
## 5 6

It is also possible to replace values in a vector this way:

x[x<2] <- 0
x
##  a  b  c  d  e  f  g  h  i  j 
##  0  2  3  4  5  6  7  8  9 10

This can be useful, for example to code a variable:

Gender <- sample(c("Male", "Female"), 
                 size = 10, 
                 replace = TRUE)
Gender
##  [1] "Male"   "Male"   "Female" "Female" "Male"   "Male"   "Male"  
##  [8] "Male"   "Female" "Female"
GenderCode <- rep(0, length(Gender))
GenderCode[Gender=="Male"] <- 1
GenderCode
##  [1] 1 1 0 0 1 1 1 1 0 0

Exercise

Say we have vectors x and y with the coordinates of points:

x <- runif(10000)
y <- runif(10000)
plot(x, y, pch=".")

subset x and y in such a way that only points in the circle are left:

plot(x1, y1, pch=".")

Matrices and Data Frames

Consider the following data frame:

students
##    Age GPA Gender
## 1   22 3.1   Male
## 2   23 3.2   Male
## 3   20 2.1   Male
## 4   22 2.1   Male
## 5   21 2.3 Female
## 6   21 2.9   Male
## 7   18 2.3 Female
## 8   22 3.9   Male
## 9   21 2.6 Female
## 10  18 3.2 Female

Because a vector has rows and columns we now need to specify both:

students[2, 3]
## [1] "Male"

There are a variety of ways to do sub-setting:

students[, 1]
##  [1] 22 23 20 22 21 21 18 22 21 18
students[[1]]
##  [1] 22 23 20 22 21 21 18 22 21 18
students$Age
##  [1] 22 23 20 22 21 21 18 22 21 18

And yet another way to do this:

attach(students)
Age
##  [1] 22 23 20 22 21 21 18 22 21 18

Exercise

What does this do?

x <- 1:10
x[]

Although these seem to do the same there actually subtle differences. Consider this:

students[, 1]
##  [1] 22 23 20 22 21 21 18 22 21 18
students[1]
##    Age
## 1   22
## 2   23
## 3   20
## 4   22
## 5   21
## 6   21
## 7   18
## 8   22
## 9   21
## 10  18
students[[1]]
##  [1] 22 23 20 22 21 21 18 22 21 18

In the first and last case R returns a vector, in the second case a data frame with one column.

It is possible to tell R not to do this type conversion in the first case

students[, 1, drop=FALSE]
##    Age
## 1   22
## 2   23
## 3   20
## 4   22
## 5   21
## 6   21
## 7   18
## 8   22
## 9   21
## 10  18

but this does not work for the [[1]] or $Age versions.

students[1:3, 1]
## [1] 22 23 20
students[-2, ]
##    Age GPA Gender
## 1   22 3.1   Male
## 3   20 2.1   Male
## 4   22 2.1   Male
## 5   21 2.3 Female
## 6   21 2.9   Male
## 7   18 2.3 Female
## 8   22 3.9   Male
## 9   21 2.6 Female
## 10  18 3.2 Female
students[1:4, -1]
##   GPA Gender
## 1 3.1   Male
## 2 3.2   Male
## 3 2.1   Male
## 4 2.1   Male
students[Age>20, ]
##   Age GPA Gender
## 1  22 3.1   Male
## 2  23 3.2   Male
## 4  22 2.1   Male
## 5  21 2.3 Female
## 6  21 2.9   Male
## 8  22 3.9   Male
## 9  21 2.6 Female

You can have several conditions, put together with & (AND), | (OR) and ! (NOT), but some care is needed:

students[Age>=20 & Age<=22, 1]
## [1] 22 20 22 21 21 22 21

is fine but

students[20 <= Age <= 22, 1]

does not work.

Exercise

Subset students so that only females over 21 with a GPA of at least 3.0 are left.

##   Age GPA Gender
## 1  22 3.1   Male
## 2  23 3.2   Male
## 8  22 3.9   Male

Lists

Sub-setting of lists is very similar to data frames:

mylist <- list(First=1:5, 
               Second=LETTERS[1:8], 
               Third=20:22)
mylist
## $First
## [1] 1 2 3 4 5
## 
## $Second
## [1] "A" "B" "C" "D" "E" "F" "G" "H"
## 
## $Third
## [1] 20 21 22
mylist[1]
## $First
## [1] 1 2 3 4 5
mylist[[1]]
## [1] 1 2 3 4 5
mylist$Second
## [1] "A" "B" "C" "D" "E" "F" "G" "H"
mylist[1:2]
## $First
## [1] 1 2 3 4 5
## 
## $Second
## [1] "A" "B" "C" "D" "E" "F" "G" "H"
mylist[[1:2]]
## [1] 2

so [1] returns a list with just one element whereas [[1]] and $ do type conversion to a vector if possible. [1:2] yields the first two elements of the list.

The last one is strange, why is the result 2? Actually, it does this:

mylist[[1]][2]
## [1] 2

This can be quite confusing. Here is useful memory device

If list x is a train, x[[5]] is the content of car 5, whereas x[4:5] is the train consisting of cars 4 and 5

Partial Matching

There is one important difference between $ and [[]], the first one allows partial matching:

x <- list(inside=1:3, outside=5:10)
x$o
## [1]  5  6  7  8  9 10
x[[o]]
## Error in eval(expr, envir, enclos): object 'o' not found

I don’t however recommend to make use of this feature unless necessary.

Useful logic commands

x<-1; y<-2; z<-3
c(x, y, z)>2.5
## [1] FALSE FALSE  TRUE
any(c(x, y, z)>2.5)
## [1] TRUE
all(c(x, y, z)>2.5)
## [1] FALSE
x <- 1:3; y <- 1:3 
x == y
## [1] TRUE TRUE TRUE
identical(x, y)
## [1] TRUE
all.equal(x, y)
## [1] TRUE

identical compares the internal representation of the data and returns TRUE if the objects are strictly identical, and FALSE otherwise.

all.equal compares the “near equality” of two objects, and returns TRUE or displays a summary of the differences. The latter function takes the approximation of the computing process into account when comparing numeric values. The comparison of numeric values on a computer is sometimes surprising!

0.9 == (1 - 0.1)
## [1] TRUE
identical(0.9, 1 - 0.1)
## [1] TRUE
all.equal(0.9, 1 - 0.1)
## [1] TRUE

but

0.9 == (1.1 - 0.2)
## [1] FALSE
identical(0.9, 1.1 - 0.2)
## [1] FALSE
all.equal(0.9, 1.1 - 0.2)
## [1] TRUE

How come \(1.1-0.2 \ne 0.9\)? This is because of machine precision issues:

all.equal(0.9, 1.1 - 0.2, tolerance = 1e-16) 
## [1] "Mean relative difference: 1.233581e-16"

subset command

Finally there is a command that was written for sub-setting:

subset(students, Age>20)
##   Age GPA Gender
## 1  22 3.1   Male
## 2  23 3.2   Male
## 4  22 2.1   Male
## 5  21 2.3 Female
## 6  21 2.9   Male
## 8  22 3.9   Male
## 9  21 2.6 Female
subset(students, Age>20 & Gender=="Male")
##   Age GPA Gender
## 1  22 3.1   Male
## 2  23 3.2   Male
## 4  22 2.1   Male
## 6  21 2.9   Male
## 8  22 3.9   Male
subset(students, Age>20, select = Gender)
##   Gender
## 1   Male
## 2   Male
## 4   Male
## 5 Female
## 6   Male
## 8   Male
## 9 Female
subset(students, Age>20, select = Gender, drop=TRUE)
## [1] "Male"   "Male"   "Male"   "Female" "Male"   "Male"   "Female"

Notice that this last one results in a vector.

Exercise

The data set upr (part of Resma3.RData) has the application information provided to the University of all students that were eventually accepted between 2003 and 2013. Here are the first three students:

head(upr, 3)
##      ID.Code Year Gender Program.Code Highschool.GPA Aptitud.Verbal
## 1 00C2B4EF77 2005      M          502           3.97            647
## 2 00D66CF1BF 2003      M          502           3.80            597
## 3 00AB6118EB 2004      M         1203           4.00            567
##   Aptitud.Matem Aprov.Ingles Aprov.Matem Aprov.Espanol IGS Freshmen.GPA
## 1           621          626         672           551 342         3.67
## 2           726          618         718           575 343         2.75
## 3           691          424         616           609 342         3.62
##   Graduated Year.Grad. Grad..GPA Class.Facultad
## 1        Si       2012      3.33           INGE
## 2        No         NA        NA           INGE
## 3        No         NA        NA       CIENCIAS

How many female students applied in either 2010 or 2011, had a high school GPA of at least 3.0 and a freshman GPA between 3.0 and 3.5?

order command

if we need to sort a vector we have the sort command:

sort(Age)
##  [1] 18 18 20 21 21 21 22 22 22 23

but sometime we want to sort one vector by the order of another:

students[order(Age), ]
##    Age GPA Gender
## 7   18 2.3 Female
## 10  18 3.2 Female
## 3   20 2.1   Male
## 5   21 2.3 Female
## 6   21 2.9   Male
## 9   21 2.6 Female
## 1   22 3.1   Male
## 4   22 2.1   Male
## 8   22 3.9   Male
## 2   23 3.2   Male

Example Look-up Tables

Say we want to write a function which many times needs to calculate \(\log(n!)\). We soon run into the following problem:

log(factorial(175))
## [1] Inf

despite the fact that this is not a really big number (it is 732.3394). The problem is that internally R uses the gamma function to calculate factorials, and \(\Gamma(175)\) is larger than what R can handle.

There is of course a simple solution:

\[ \log(n!) = \log(\prod_{i=1}^n i) = \sum_{i=1}^n \log i \] but this turns out to be quite slow. Here is a better solution: we will make use of Sterling’s formula:

\[ \begin{aligned} &n! \sim n^ne^{-n}\sqrt{2\pi n} \\ &\log(n!) \sim n\log(n) - n+\frac12\log(2 \pi n) \\ \end{aligned} \]

say we will need to find \(\log(n!)\) for n ranging from 0 to 500, then we can create a look-up table. For small values of \(n\) we use the exact formula, then the approximation:

logfac <- 0:500
names(logfac) <- logfac
for(n in 0:50) 
  logfac[n+1] <- log(factorial(n))
for(n in 51:500) 
  logfac[n+1] <- n*log(n)-n+0.5*log(2*pi*n) 

and now we can find various values very easy:

logfac[c("2", 5, 30, 301, 30)]
##            2            5           30          301           30 
##    0.6931472    4.7874917   74.6582363 1420.6126834   74.6582363

Exercise

why did I use

logfac[c(“2”, 5, 30, 301, 30)]

and not

logfac[c(2, 5, 30, 301, 30)]