library(stringr)
library(magrittr)
library(tidyr)
library(dplyr)
library(wordcloud)
library(tidytext)
library(gutenbergr)
Thanks to Hadley Wickham, we have the package stringr that adds more functionality to the base functions for handling strings in R. According to the description of the package at http://cran.r-project.org/web/packages/stringr/index.html stringr is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.
We previously looked at the states in the US:
head(states)
## [1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
## [6] "Colorado"
We can find out how many letters each state has with
states %>%
str_length()
## [1] 7 6 7 8 10 8 11 8 20 7 7 6 5 8 7 4 6 8 9 5 8 13 8
## [24] 9 11 8 7 8 6 13 10 10 8 14 12 4 8 6 12 12 14 12 9 5 4 7
## [47] 8 10 13 9 7 11
Also, we found out how many vowels the names of each state had. Here is how we can do that with stringr:
states %>%
str_count("a")
## [1] 3 2 1 2 2 1 0 2 1 1 1 2 1 0 2 1 2 0 2 1 2 2 1 1 0 0 2 2 2 1 0 0 0 2 2
## [36] 0 2 0 2 1 2 2 0 1 1 0 1 1 1 0 0 0
Notice that we are only getting the number of a’s in lower case. Since str_count() does not contain the argument ignore.case, we need to transform all letters to lower case, and then count the number of a’s like this:
states %>%
tolower() %>%
str_count("a")
## [1] 4 3 2 3 2 1 0 2 1 1 1 2 1 0 2 1 2 0 2 1 2 2 1 1 0 0 2 2 2 1 0 0 0 2 2
## [36] 0 2 0 2 1 2 2 0 1 1 0 1 1 1 0 0 0
Now let’s do this for all the vowels:
vowels <- c("a", "e", "i", "o", "u")
states %>%
tolower() %>%
str_split("") %>%
unlist() %>%
table() ->
x
x[vowels]
## .
## a e i o u
## 62 29 48 40 10
stringr provides functions for both
The following table contains the stringr functions for basic string operations:
Function | Description | Similar to |
---|---|---|
str_c() | string concatenation | paste() |
str_length() | number of characters | nchar() |
str_sub() | extracts substrings | substring() |
str_dup() | duplicates characters | none |
str_trim() | removes leading and trailing whitespace | none |
str_pad() | pads a string | none |
str_wrap() | wraps a string paragraph | strwrap() |
str_trim() | trims a string | none |
Here are some examples:
paste("It", "is", "a", "nice", "day", "today")
## [1] "It is a nice day today"
str_c("It", "is", "a", "nice", "day", "today")
## [1] "Itisanicedaytoday"
str_c("It", "is", "a", "nice", "day", "today",
sep=" ")
## [1] "It is a nice day today"
str_c("It", "is", "a", "nice", "day", "today",
sep="-")
## [1] "It-is-a-nice-day-today"
next str_length. Compared to nchar() it can handle more data types, for example factors:
some_factor <- factor(c(1, 1, 1, 2, 2, 2),
labels = c("good", "bad"))
some_factor
## [1] good good good bad bad bad
## Levels: good bad
str_length(some_factor)
## [1] 4 4 4 3 3 3
whereas nchar(some_factor) results in an error.
A routine that has no direct equivalent in basic R is str_dup. It is sort of a rep for strings:
str_dup("ab", 2)
## [1] "abab"
str_dup("ab", 1:3)
## [1] "ab" "abab" "ababab"
Another handy function that we can find in stringr is str_pad() for padding a string. This is useful if we want to have a nice alignment when printing some text.
Its default usage has the following form:
str_pad(string, width, side = “left”, pad = " “)
The idea of str_pad() is to take a string and pad it with leading or trailing characters to a specified total width. The default padding character is a space (pad = " “), and consequently the returned string will appear to be either left-aligned (side =”left" ), right-aligned (side = “right”), or both (side = “both”).
Let’s see some examples:
str_pad("Great!", width = 7)
## [1] " Great!"
str_pad("Great", width = 8, side = "both")
## [1] " Great "
str_pad(str_pad("Great!", width = 7), width = 8, pad="#")
## [1] "# Great!"
Often when dealing with character vectors we end up with white spaces. These are easily taken care of with str_trim:
txt <- c("some", " text", "with ", " white ", "space")
str_trim(txt)
## [1] "some" "text" "with" "white" "space"
An operation that one needs to do quite often is to extract the last (few) letters from words. Using substring is tricky if the words don’t have the same lengths:
substring(txt, nchar(txt)-1, nchar(txt))
## [1] "me" "xt" "h " "e " "ce"
Much easier with
str_sub(txt, -2, -1)
## [1] "me" "xt" "h " "e " "ce"
You can use str_wrap() to modify existing whitespace in order to wrap a paragraph of text, such that the length of each line is as similar as possible.
declaration.of.independence <- "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness."
cat(str_wrap(declaration.of.independence, width=40))
## We hold these truths to be self-evident,
## that all men are created equal, that
## they are endowed by their Creator with
## certain unalienable Rights, that among
## these are Life, Liberty and the pursuit
## of Happiness.
Let’s do a textual analysis of Bram Stoker’s Dracula. We can get an electronic copy of the book from the Project Gutenberg http://www.gutenberg.org/. To get a book into R is very easy, there is a package:
dracula <- gutenberg_download(345)
dracula
## Error: package 'fansi' was installed by an R version with different internals; it needs to be reinstalled for use with this R version
Why 345? This is the id number used by the Gutenberg web site to identify this book. Go to their website and check out what other books they have (there are over 57000 as of 2018).
The first column is the Gutenberg_id, so we can get rid of that
dracula <- dracula[, 2]
Let’s see the beginning of the book:
dracula[1:100, ] %>%
str_wrap(width=40) %>%
cat()
## c(" DRACULA", "", "", "", "", "",
## " DRACULA", "", " _by_", "", "
## Bram Stoker", "", " [Illustration:
## colophon]", "", " NEW YORK", "", "
## GROSSET & DUNLAP", "", " _Publishers_",
## "", " Copyright, 1897, in the United
## States of America, according", " to
## Act of Congress, by Bram Stoker", "",
## " [_All rights reserved._]", "", "
## PRINTED IN THE UNITED STATES", " AT",
## " THE COUNTRY LIFE PRESS, GARDEN CITY,
## N.Y.", "", "", "", "", " TO", "", " MY
## DEAR FRIEND", "", " HOMMY-BEG", "", "",
## "", "", "CONTENTS", "", "", "CHAPTER I",
## " Page", "", "Jonathan Harker's Journal
## 1", "", "CHAPTER II", "", "Jonathan
## Harker's Journal 14", "", "CHAPTER
## III", "", "Jonathan Harker's Journal
## 26", "", "CHAPTER IV", "", "Jonathan
## Harker's Journal 38", "", "CHAPTER V",
## "", "Letters--Lucy and Mina 51", "",
## "CHAPTER VI", "", "Mina Murray's Journal
## 59", "", "CHAPTER VII", "", "Cutting
## from \"The Dailygraph,\" 8 August 71",
## "", "CHAPTER VIII", "", "Mina Murray's
## Journal 84", "", "CHAPTER IX", "",
## "Mina Murray's Journal 98", "", "CHAPTER
## X", "", "Mina Murray's Journal 111",
## "", "CHAPTER XI", "", "Lucy Westenra's
## Diary 124", "", "CHAPTER XII", "",
## "Dr. Seward's Diary 136", "", "CHAPTER
## XIII", "", "Dr. Seward's Diary 152",
## "", "CHAPTER XIV", "", "Mina Harker's
## Journal 167")
What are the most commonly used words in the book? Well, it will be something like “a”, “and” etc. Those kinds of words are called stop_words, and they are not very interesting, and it might be better to just take them out. There are lists of such words. One of them is in the library tidytext:
stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
So we now want to go through Dracula and remove all the appearances of any of the words in stop_words. This can be done with the dplyr command anti_join. However the two lists need to have the same column names, and in Dracula it is text whereas in stop_words it is word. Again, the library tidytext has a command:
dracula %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) ->
dracula
dracula
## # A tibble: 48,552 x 1
## word
## <chr>
## 1 dracula
## 2 dracula
## 3 _by_
## 4 bram
## 5 stoker
## 6 illustration
## 7 colophon
## 8 york
## 9 grosset
## 10 dunlap
## # ... with 48,542 more rows
So now for the most common words:
dracula %>%
count(word, sort=TRUE)
## # A tibble: 9,072 x 2
## word n
## <chr> <int>
## 1 time 390
## 2 van 323
## 3 night 310
## 4 helsing 301
## 5 dear 224
## 6 lucy 223
## 7 day 220
## 8 hand 210
## 9 mina 210
## 10 door 200
## # ... with 9,062 more rows
so time is the most common word, it appears 390 times in the book.
Here is a nice graph called a wordcloud to illustrate word frequencies:
dracula %>%
as.character() %>%
wordcloud(min.freq=100, random.order=FALSE)
How do the words in Dracula compare to another famous fiction book of the era, The Time Machine, by H. G. Wells? This is book # 35 in the Gutenberg catalog:
time.machine <- gutenberg_download(35)[, 2]
time.machine %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) ->
time.machine
time.machine %>%
count(word, sort=T)
## # A tibble: 4,135 x 2
## word n
## <chr> <int>
## 1 time 200
## 2 machine 85
## 3 white 59
## 4 traveller 55
## 5 world 52
## 6 hand 49
## 7 morlocks 46
## 8 people 46
## 9 weena 46
## 10 found 44
## # ... with 4,125 more rows
Actually, time is the most common word in both books (not a surprise in a book called The Time Machine!)
Can we do a graphical display of the word frequencies? We will need some routines from yet another package called tidyr.
We begin by joining the two books together, with a new column identifying the book:
freqs <- bind_rows(mutate(dracula, book="Dracula"),
mutate(time.machine, book="Time.Machine"))
freqs
## # A tibble: 59,663 x 2
## word book
## <chr> <chr>
## 1 dracula Dracula
## 2 dracula Dracula
## 3 _by_ Dracula
## 4 bram Dracula
## 5 stoker Dracula
## 6 illustration Dracula
## 7 colophon Dracula
## 8 york Dracula
## 9 grosset Dracula
## 10 dunlap Dracula
## # ... with 59,653 more rows
Next we add some useful columns:
freqs %>%
mutate(word=str_extract(word, "[a-z']+")) %>%
# take out things like , etc
count(book, word) %>%
group_by(book) %>%
mutate(prop=n/sum(n)) %>% # take into account
# different lengths of the books
ungroup() %>%
filter(n>10) %>% # consider only words used frequently
select(-n) %>% # not needed anymore
arrange(desc(prop)) ->
freqs
Next we find all the words that appear in both books, and look at their relative proportions:
freqs %>%
spread(key=book, value = prop) %>%
# use one column for Dracula's
# proportions and another for Time Machine
na.omit() -> # words that appear in only one book are NA,
# eliminate them
common.words
print(common.words, n=4)
## # A tibble: 103 x 3
## word Dracula Time.Machine
## <chr> <dbl> <dbl>
## 1 absolutely 0.000247 0.000990
## 2 age 0.000288 0.00126
## 3 air 0.00111 0.00207
## 4 altogether 0.000330 0.000990
## # ... with 99 more rows
common.words %>%
ggplot(aes(x=Dracula, y=common.words$Time.Machine)) +
labs(x = "Dracula",
y = "Time Machine") +
geom_text(aes(label = word),
check_overlap = TRUE,
vjust = 1.5)