The traditional workflow of R comes in large part from other computer languages. So a typical sequence would be like this:
df <- mtcars
df1 <- subset(df, hp>70)
ggplot(data=df1, aes(disp, mpg)) +
geom_point()
This is not how we think, though. That would go something like this:
In addition there is also the issue that we had to create an intermediate data set (df1).
The pipe was invented to fix all of these problems.
The basic package to use piping is
library(magrittr)
invented by Stefan Milton. The basic operator is %>%. The same as above can be done with
mtcars %>%
subset(x=., hp>70) %>% #R knows what hp is!
ggplot(data=., aes(disp, mpg)) +
geom_point()
In principle the pipe can always be used in this way:
x <- rnorm(10)
mean(x)
## [1] -0.03854
x %>%
mean()
## [1] -0.03854
Notice that here we called both mean and round without a needed argument. In principle the pipe will always use the data on the left of %>% as the first argument of the command on the right.
Exercise
Consider the following operation:
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
round(exp(diff(log(x))), 1)
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
write the same using the pipe
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
At first it may not seem like writing x %>% f() is any easier than writing f(x), but this style of coding becomes very useful when applying multiple functions; in this case piping will allow one to think from left to right in the logical order of functions rather than from inside to outside in an ugly large nested statement of functions.
The pipe is only a few years old, but there are already many packages that take advantage of it. The most important one, and a package useful in and of itself, is
library(dplyr)
written by Hadley Wickham. In essence it is a replacement for the apply family of R routines. We can also write the above with
mtcars %>%
filter(hp>70) %>%
ggplot(aes(disp, mpg)) +
geom_point()
Notice how filter is aware of the pipe, it doesn’t need to be told that it is supposed to work with mtcars. So far, ggplot is not fully pipe aware (otherwise we could have written %>% geom_point()), but this will change in the near future.
dataframes have been the main data format of R since its beginnings, and are likely to stay that way for a long time. They do, however have some shortcomings. Among other things, when you type the name of a dataframe and hit enter, all of it is shown, even if the data set is huge. On the other hand, interesting information such as the data types of the columns is not shown. To help with these (and some other) issues the data format tibble was invented. We can turn a dataframe into a tibble with
tmtcars <- as.tbl(mtcars)
tmtcars
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ... with 22 more rows
so we have all relevant information about the data set: its size (32x11), the variables and their formats, and the beginning of the data set.
tibbles are also designed to work well with piping and with the package dplyr.
If you want to create a tibble from scratch use:
tibble(x=1:5, y=x^2)
## # A tibble: 5 x 2
## x y
## <int> <dbl>
## 1 1 1
## 2 2 4
## 3 3 9
## 4 4 16
## 5 5 25
Also, tibbles never use row.names, and it only recycles vectors of length 1. This is because recycling vectors of greater lengths is a frequent source of bugs.
We have already seen the filter command, the dplyr version of subset. Here are the most important dplyr commands:
The library babynames (also by Hadley Wickham) has the number of children of each sex given each name for each year from 1880 to 2015 according to the US census. All names with more than 5 uses are given.
We want to do the following:
library(babynames)
babynames %>%
filter(name %>% substr(1, 1) %>% equals("W")) %>%
group_by(year, sex) %>%
summarise(total = sum(n)) %>%
ggplot(data = ., aes(year, total, color = sex)) +
geom_line() +
labs(color="Gender") +
ggtitle('Names starting with W')
How often is my name used for a baby in the US?
babynames %>%
filter(name == "Wolfgang") %>%
ggplot(data = ., aes(year, n)) +
geom_line()
Looks like my name is getting more popular (even if it is still rare)!
What were the most popular girls names each year?
babynames %>% # take babynames
filter(sex=="F") %>% # then pick girls only
group_by(year) %>% # then separate the years
mutate(M=max(n)) %>% # then find most often used name
filter(n==M) %>% # then pick only those rows
ungroup() %>% # then join data back together
select(name) %>% # then select names only
table() %>% # then count how often each happened
sort(decreasing = TRUE) %>% # then organize data
cbind() # then turn data around for easier reading
## .
## Mary 76
## Jennifer 15
## Emily 12
## Jessica 9
## Lisa 8
## Linda 6
## Emma 5
## Sophia 3
## Ashley 2
## Isabella 2
Let’s say we want to save a data set made with the pipe. Logically we should be able to do this
babynames %>% # take babynames
filter(name=="Wolfgang") %>% # then pick me
wolfgangs # then give new data set a name
## Error in wolfgangs(.): could not find function "wolfgangs"
but that results in an error, only functions can be used in a pipe. So it is done like this:
wolfgangs <- babynames %>% # take babynames
filter(name=="Wolfgang") # then pick me
print(wolfgangs, n=3)
## # A tibble: 71 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1929 M Wolfgang 8 0.00000722
## 2 1930 M Wolfgang 6 0.00000531
## 3 1932 M Wolfgang 7 0.00000652
## # ... with 68 more rows
This unfortunately breaks the logic of piping. There is a better way, though. Just remember the logic of the assignment character <-, it’s an arrow!
babynames %>% # take babynames
filter(name=="Wolfgang") -> # then pick me
wolfgangs # then assign it a name
print(wolfgangs, n=3)
## # A tibble: 71 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1929 M Wolfgang 8 0.00000722
## 2 1930 M Wolfgang 6 0.00000531
## 3 1932 M Wolfgang 7 0.00000652
## # ... with 68 more rows
Here is a common problem: say you have these two data sets:
students1
## # A tibble: 3 x 2
## name exam1
## <chr> <dbl>
## 1 Alex 78
## 2 Ann 85
## 3 Marie 93
students2
## # A tibble: 3 x 2
## name exam2
## <chr> <dbl>
## 1 Alex 75
## 2 Ann 89
## 3 Marie 97
and we want to join them into one data set:
students1 %>%
left_join(students2)
## # A tibble: 3 x 3
## name exam1 exam2
## <chr> <dbl> <dbl>
## 1 Alex 78 75
## 2 Ann 85 89
## 3 Marie 93 97
Let’s say we want to find the times out of 100000 that the most popular names occurred in the 2015:
babynames %>%
filter(year==2015) %>%
mutate(freq=round(n/sum(n)*100000)) %>%
select(name, freq) %>%
arrange(desc(freq)) %>%
print(n=5)
## # A tibble: 33,098 x 2
## name freq
## <chr> <dbl>
## 1 Emma 554
## 2 Olivia 533
## 3 Noah 532
## 4 Liam 498
## 5 Sophia 472
## # ... with 3.309e+04 more rows
so the mutate command let’s us calculate new variables and the arrange command let’s us change the order of the rows.
ggplot2 and dplyr are two of a number of packages that together form the tidyverse. They are centered around what is called tidy data. For a detailed discussion go to https://www.tidyverse.org/.
The core packages are
but you can get all of them in one step with
install.packages("tidyverse")
tidy data is defined as data were
This is essentially the definition of a data frame, but it is enforced even more so by the tibbles format. The theory behind tidy data was described by Hadley Wickham in the article Tidy Data, Journal of Statistical Software. The packages in the tidyverse are all written to have a consistent look and feel and work naturally with tidy data.
One big difference between dataframes and tibbles is that tibbles automatically ignore row names:
head(mtcars, 3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
tbl.mtcars <- as.tbl(mtcars)
print(tbl.mtcars, n=3)
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## # ... with 29 more rows
This of course is no good here, the names of the cars are important. One way to fix this is to use the rownames_to_column routine in the tibbles package:
library("tibble")
mtcars %>%
as.tbl() %>%
rownames_to_column() %>%
print(n=3)
## # A tibble: 32 x 12
## rowname mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## # ... with 29 more rows
One difficulty is to remember which routine is in what package. The best way is to simply load them all with
library(tidyverse)