We have previously discussed factors, that is categorical data with fixed values and ordering. Now we will discuss the package forcats, which has a number of useful functions when working with factors.
library(tidyverse)
library(forcats)
Let’s remind ourselves of the base R commands first. Consider the data set draft, with the results of the 1970s military draft:
draft %>%
ggplot(aes(Day.of.Year, Draft.Number)) +
geom_point()
Let’s say instead we want to do a box plot of the draft numbers by month:
draft %>%
ggplot(aes(Month, Draft.Number)) +
geom_boxplot()
Now this is no good, the ordering of the boxes is alphabetic. So we need to change the variable Month to a factor:
lvls <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
Month.fac <- factor(draft$Month,
levels = lvls,
ordered = TRUE)
df <- data.frame(Month=Month.fac,
Draft_Number=draft$Draft.Number)
df %>%
ggplot(aes(Month, Draft_Number)) +
geom_boxplot()
Quite often the order we want is the order in which the values appear in the data set, then we can use
lvls <- unique(draft$Month)
The forcats package includes a data set called gss_cat:
gss_cat
## # A tibble: 21,483 x 9
## year marital age race rincome partyid relig denom tvhours
## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 2000 Never ma~ 26 White $8000 to~ Ind,near ~ Protes~ Southe~ 12
## 2 2000 Divorced 48 White $8000 to~ Not str r~ Protes~ Baptis~ NA
## 3 2000 Widowed 67 White Not appl~ Independe~ Protes~ No den~ 2
## 4 2000 Never ma~ 39 White Not appl~ Ind,near ~ Orthod~ Not ap~ 4
## 5 2000 Divorced 25 White Not appl~ Not str d~ None Not ap~ 1
## 6 2000 Married 25 White $20000 -~ Strong de~ Protes~ Southe~ NA
## 7 2000 Never ma~ 36 White $25000 o~ Not str r~ Christ~ Not ap~ 3
## 8 2000 Divorced 44 White $7000 to~ Ind,near ~ Protes~ Luther~ NA
## 9 2000 Married 44 White $25000 o~ Not str d~ Protes~ Other 0
## 10 2000 Married 47 White $25000 o~ Strong re~ Protes~ Southe~ 3
## # ... with 21,473 more rows
which has the results of the General Social Survey (http://gss.norc.org). This is a survey in the US done by the University of Chicago. We will use it to illustrate forcats.
Let’s begin by considering the variable race:
gss_cat$race %>%
table()
## .
## Other Black White Not applicable
## 1959 3129 16395 0
We can do the same thing with tidyverse routines:
gss_cat %>%
count(race)
## # A tibble: 3 x 2
## race n
## <fct> <int>
## 1 Other 1959
## 2 Black 3129
## 3 White 16395
Notice a bit of a difference: In the first case there is the Not applicable group but not in the second. This is because “race” is a factor and this is among its levels. The table command includes all levels, even if the count is 0, whereas count does not. This is likely what we want most times, but not all the times.
By the way, we can always find out what the levels are:
levels(gss_cat$race)
## [1] "Other" "Black" "White" "Not applicable"
Let’s consider the average number of hours that a person spends watching TV per day, depending on their religion:
gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
) ->
tv.relig
tv.relig %>%
ggplot(aes(tvhours, relig)) +
geom_point()
This graph is hard to read, mainly because there is no ordering. But unlike Month the variable itself doesn’t have any either. So maybe we should order by size:
tv.relig %>%
ggplot(aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point() +
labs(x="TV Hours",
y="Religion")
Let’s see how income varies with age:
gss_cat %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
n = n()
) ->
rincome
rincome %>%
ggplot(aes(age, rincome)) +
geom_point() +
labs(x="Age", y="Income")
What ordering makes the most sense here? There are two types of levels: those with actual numbers, and those like “Not applicable”. We should probably separate them.
rincome %>%
ggplot(aes(age, fct_relevel(rincome,
c("Not applicable", "Refuse", "No answer")))) +
geom_point()+
labs(x="Age", y="Income")
In a bar graph the most common ordering is by size:
gss_cat %>%
mutate(marital = marital %>%
fct_infreq() %>% fct_rev()
) %>%
ggplot(aes(marital)) +
geom_bar() +
labs(x="Marital Status",
y="Counts")