The pipe, dplyr, tibbles, tidyverse

The pipe

The traditional workflow of R comes in large part from other computer languages. So a typical sequence would be like this:

df <- mtcars
df1 <- subset(df, hp>70)
ggplot(data=df1, aes(disp, mpg)) + 
  geom_point()

This is not how we think, though. That would go something like this:

  • take the mtcars data set
  • then pick only cars with hp over 70
  • then do the scatterplot of mpg vs. disp

In addition there is also the issue that we had to create an intermediate data set (df1).

The pipe was invented to fix all of these problems.

The basic package to use piping is

library(magrittr)

invented by Stefan Milton. The basic operator is %>%. The same as above can be done with

mtcars %>%
  subset(x=., hp>70) %>% #R knows what hp is!
  ggplot(data=., aes(disp, mpg)) + 
     geom_point()

In principle the pipe can always be used in this way:

x <- rnorm(10)
mean(x)
## [1] -0.03854
x %>% 
  mean()
## [1] -0.03854

Notice that here we called both mean and round without a needed argument. In principle the pipe will always use the data on the left of %>% as the first argument of the command on the right.

Exercise

Consider the following operation:

x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
round(exp(diff(log(x))), 1)
## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1

write the same using the pipe

## [1]  3.3  1.8  1.6  0.5  0.3  0.1 48.8  1.1

At first it may not seem like writing x %>% f() is any easier than writing f(x), but this style of coding becomes very useful when applying multiple functions; in this case piping will allow one to think from left to right in the logical order of functions rather than from inside to outside in an ugly large nested statement of functions.


The pipe is only a few years old, but there are already many packages that take advantage of it. The most important one, and a package useful in and of itself, is

library(dplyr)

written by Hadley Wickham. In essence it is a replacement for the apply family of R routines. We can also write the above with

mtcars %>%
  filter(hp>70) %>%
  ggplot(aes(disp, mpg)) + 
     geom_point()

Notice how filter is aware of the pipe, it doesn’t need to be told that it is supposed to work with mtcars. So far, ggplot is not fully pipe aware (otherwise we could have written %>% geom_point()), but this will change in the near future.

tibbles

dataframes have been the main data format of R since its beginnings, and are likely to stay that way for a long time. They do, however have some shortcomings. Among other things, when you type the name of a dataframe and hit enter, all of it is shown, even if the data set is huge. On the other hand, interesting information such as the data types of the columns is not shown. To help with these (and some other) issues the data format tibble was invented. We can turn a dataframe into a tibble with

tmtcars <- as.tbl(mtcars)
tmtcars
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

so we have all relevant information about the data set: its size (32x11), the variables and their formats, and the beginning of the data set.

tibbles are also designed to work well with piping and with the package dplyr.

If you want to create a tibble from scratch use:

tibble(x=1:5, y=x^2)
## # A tibble: 5 x 2
##       x     y
##   <int> <dbl>
## 1     1     1
## 2     2     4
## 3     3     9
## 4     4    16
## 5     5    25

Also, tibbles never use row.names, and it only recycles vectors of length 1. This is because recycling vectors of greater lengths is a frequent source of bugs.

dplyr library

We have already seen the filter command, the dplyr version of subset. Here are the most important dplyr commands:

  • filter selects part of a data set by conditions (base R command: subset)
  • select selects columns (base R command: [ ])
  • arrange re-orders or arranges rows (base R commands: sort, order)
  • mutate creates new columns (base R commands: any math function)
  • summarise summarises values (base R commands: mean, median etc)
  • group_by allows for group operations in the “split-apply-combine” concept (base R command: none)

Example: babynames

The library babynames (also by Hadley Wickham) has the number of children of each sex given each name for each year from 1880 to 2015 according to the US census. All names with more than 5 uses are given.

We want to do the following:

  • take the names
  • pick out all of those that start with “W”
  • separate the genders
  • find the total for each year
  • do the line graph
library(babynames)
babynames %>%
    filter(name %>% substr(1, 1) %>% equals("W")) %>%
    group_by(year, sex) %>%
    summarise(total = sum(n)) %>%
    ggplot(data = ., aes(year, total, color = sex)) +
      geom_line() + 
      labs(color="Gender") +
      ggtitle('Names starting with W') 

How often is my name used for a baby in the US?

babynames %>%
    filter(name == "Wolfgang") %>%
    ggplot(data = ., aes(year, n)) +
      geom_line()

Looks like my name is getting more popular (even if it is still rare)!

What were the most popular girls names each year?

babynames %>%              # take babynames
    filter(sex=="F") %>%   # then pick girls only
    group_by(year) %>%     # then separate the years
    mutate(M=max(n)) %>%   # then find most often used name
    filter(n==M) %>%       # then pick only those rows
    ungroup() %>%          # then join data back together
    select(name) %>%       # then select names only
    table() %>%            # then count how often each happened  
    sort(decreasing = TRUE) %>% # then organize data
    cbind()               #  then turn data around for easier reading  
##           .
## Mary     76
## Jennifer 15
## Emily    12
## Jessica   9
## Lisa      8
## Linda     6
## Emma      5
## Sophia    3
## Ashley    2
## Isabella  2

Let’s say we want to save a data set made with the pipe. Logically we should be able to do this

babynames %>%              # take babynames
  filter(name=="Wolfgang") %>%   # then pick me
  wolfgangs                # then give new data set a name
## Error in wolfgangs(.): could not find function "wolfgangs"

but that results in an error, only functions can be used in a pipe. So it is done like this:

wolfgangs <- babynames %>%              # take babynames
  filter(name=="Wolfgang")    # then pick me
print(wolfgangs, n=3)
## # A tibble: 71 x 5
##    year sex   name         n       prop
##   <dbl> <chr> <chr>    <int>      <dbl>
## 1  1929 M     Wolfgang     8 0.00000722
## 2  1930 M     Wolfgang     6 0.00000531
## 3  1932 M     Wolfgang     7 0.00000652
## # ... with 68 more rows

This unfortunately breaks the logic of piping. There is a better way, though. Just remember the logic of the assignment character <-, it’s an arrow!

babynames %>%              # take babynames
  filter(name=="Wolfgang") ->   # then pick me
  wolfgangs               # then assign it a name
print(wolfgangs, n=3)
## # A tibble: 71 x 5
##    year sex   name         n       prop
##   <dbl> <chr> <chr>    <int>      <dbl>
## 1  1929 M     Wolfgang     8 0.00000722
## 2  1930 M     Wolfgang     6 0.00000531
## 3  1932 M     Wolfgang     7 0.00000652
## # ... with 68 more rows

Here is a common problem: say you have these two data sets:

students1
## # A tibble: 3 x 2
##   name  exam1
##   <chr> <dbl>
## 1 Alex     78
## 2 Ann      85
## 3 Marie    93
students2
## # A tibble: 3 x 2
##   name  exam2
##   <chr> <dbl>
## 1 Alex     75
## 2 Ann      89
## 3 Marie    97

and we want to join them into one data set:

students1 %>%
  left_join(students2)
## # A tibble: 3 x 3
##   name  exam1 exam2
##   <chr> <dbl> <dbl>
## 1 Alex     78    75
## 2 Ann      85    89
## 3 Marie    93    97

Let’s say we want to find the times out of 100000 that the most popular names occurred in the 2015:

babynames %>%
  filter(year==2015) %>%
  mutate(freq=round(n/sum(n)*100000)) %>%
  select(name, freq) %>%
  arrange(desc(freq)) %>%
  print(n=5)
## # A tibble: 33,098 x 2
##   name    freq
##   <chr>  <dbl>
## 1 Emma     554
## 2 Olivia   533
## 3 Noah     532
## 4 Liam     498
## 5 Sophia   472
## # ... with 3.309e+04 more rows

so the mutate command let’s us calculate new variables and the arrange command let’s us change the order of the rows.

The tidyverse

ggplot2 and dplyr are two of a number of packages that together form the tidyverse. They are centered around what is called tidy data. For a detailed discussion go to https://www.tidyverse.org/.

The core packages are

  • ggplot2
  • dplyr
  • tidyr
  • readr
  • purrr
  • tibble
  • stringr
  • forcats

but you can get all of them in one step with

install.packages("tidyverse")

tidy data is defined as data were

  1. Each variable you measure should be in one column.
  2. Each different observation of that variable should be in a different row.
  3. There should be one table for each “kind” of variable.
  4. If you have multiple tables, they should include a column in the table that allows them to be linked.

This is essentially the definition of a data frame, but it is enforced even more so by the tibbles format. The theory behind tidy data was described by Hadley Wickham in the article Tidy Data, Journal of Statistical Software. The packages in the tidyverse are all written to have a consistent look and feel and work naturally with tidy data.

One big difference between dataframes and tibbles is that tibbles automatically ignore row names:

head(mtcars, 3)
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
tbl.mtcars <- as.tbl(mtcars)
print(tbl.mtcars, n=3)
## # A tibble: 32 x 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## # ... with 29 more rows

This of course is no good here, the names of the cars are important. One way to fix this is to use the rownames_to_column routine in the tibbles package:

library("tibble")
mtcars %>%
   as.tbl() %>%
   rownames_to_column() %>%
   print(n=3)
## # A tibble: 32 x 12
##   rowname   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1        21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2 2        21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3 3        22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
## # ... with 29 more rows

One difficulty is to remember which routine is in what package. The best way is to simply load them all with

library(tidyverse)