forcats.utf8.md

Factors with forcats

We have previously discussed factors, that is categorical data with fixed values and ordering. Now we will discuss the package forcats, which has a number of useful functions when working with factors.

library(tidyverse)
library(forcats)

Let’s remind ourselves of the base R commands first. Consider the data set draft, with the results of the 1970s military draft:

draft %>%
  ggplot(aes(Day.of.Year, Draft.Number)) + 
     geom_point()

Let’s say instead we want to do a box plot of the draft numbers by month:

draft %>%
  ggplot(aes(Month, Draft.Number)) + 
     geom_boxplot()

Now this is no good, the ordering of the boxes is alphabetic. So we need to change the variable Month to a factor:

lvls <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
          "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
Month.fac <- factor(draft$Month, 
                    levels = lvls, 
                    ordered = TRUE)
df <- data.frame(Month=Month.fac,
                 Draft_Number=draft$Draft.Number)
df %>%
  ggplot(aes(Month, Draft_Number)) + 
     geom_boxplot()

Quite often the order we want is the order in which the values appear in the data set, then we can use

lvls <- unique(draft$Month)

The forcats package includes a data set called gss_cat:

gss_cat

## # A tibble: 21,483 x 9
##     year marital     age race  rincome   partyid    relig   denom   tvhours
##    <int> <fct>     <int> <fct> <fct>     <fct>      <fct>   <fct>     <int>
##  1  2000 Never ma~    26 White $8000 to~ Ind,near ~ Protes~ Southe~      12
##  2  2000 Divorced     48 White $8000 to~ Not str r~ Protes~ Baptis~      NA
##  3  2000 Widowed      67 White Not appl~ Independe~ Protes~ No den~       2
##  4  2000 Never ma~    39 White Not appl~ Ind,near ~ Orthod~ Not ap~       4
##  5  2000 Divorced     25 White Not appl~ Not str d~ None    Not ap~       1
##  6  2000 Married      25 White $20000 -~ Strong de~ Protes~ Southe~      NA
##  7  2000 Never ma~    36 White $25000 o~ Not str r~ Christ~ Not ap~       3
##  8  2000 Divorced     44 White $7000 to~ Ind,near ~ Protes~ Luther~      NA
##  9  2000 Married      44 White $25000 o~ Not str d~ Protes~ Other         0
## 10  2000 Married      47 White $25000 o~ Strong re~ Protes~ Southe~       3
## # ... with 21,473 more rows

which has the results of the General Social Survey (http://gss.norc.org). This is a survey in the US done by the University of Chicago. We will use it to illustrate forcats.

Let’s begin by considering the variable race:

gss_cat$race %>% 
  table()

## .
##          Other          Black          White Not applicable 
##           1959           3129          16395              0

We can do the same thing with tidyverse routines:

gss_cat %>%
  count(race)

## # A tibble: 3 x 2
##   race      n
##   <fct> <int>
## 1 Other  1959
## 2 Black  3129
## 3 White 16395

Notice a bit of a difference: In the first case there is the Not applicable group but not in the second. This is because “race” is a factor and this is among its levels. The table command includes all levels, even if the count is 0, whereas count does not. This is likely what we want most times, but not all the times.

By the way, we can always find out what the levels are:

levels(gss_cat$race)

## [1] "Other"          "Black"          "White"          "Not applicable"

Let’s consider the average number of hours that a person spends watching TV per day, depending on their religion:

gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  ) ->
  tv.relig
tv.relig %>%
  ggplot(aes(tvhours, relig)) +
  geom_point()

This graph is hard to read, mainly because there is no ordering. But unlike Month the variable itself doesn’t have any either. So maybe we should order by size:

tv.relig %>%
  ggplot(aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point() +
  labs(x="TV Hours",
       y="Religion")

Let’s see how income varies with age:

gss_cat %>%
  group_by(rincome) %>% 
  summarise(
    age = mean(age, na.rm = TRUE),
    n = n()
  ) -> 
  rincome
rincome %>%
  ggplot(aes(age, rincome)) +
  geom_point() +
  labs(x="Age", y="Income")

What ordering makes the most sense here? There are two types of levels: those with actual numbers, and those like “Not applicable”. We should probably separate them.

rincome %>%
  ggplot(aes(age, fct_relevel(rincome, 
          c("Not applicable", "Refuse", "No answer")))) +
  geom_point()+
  labs(x="Age", y="Income")

In a bar graph the most common ordering is by size:

gss_cat %>% 
  mutate(marital = marital %>% 
         fct_infreq() %>% fct_rev()
  ) %>% 
  ggplot(aes(marital)) +
     geom_bar() +
     labs(x="Marital Status",
          y="Counts")