Character Manipulation with stringr

Thanks to Hadley Wickham, we have the package stringr that adds more functionality to the base functions for handling strings in R. According to the description of the package at http://cran.r-project.org/web/packages/stringr/index.html stringr is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.

We previously looked at the states in the US:

head(states)

## [1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"
## [6] "Colorado"

We can find out how many letters each state has with

states %>%
  str_length()

##  [1]  7  6  7  8 10  8 11  8 20  7  7  6  5  8  7  4  6  8  9  5  8 13  8
## [24]  9 11  8  7  8  6 13 10 10  8 14 12  4  8  6 12 12 14 12  9  5  4  7
## [47]  8 10 13  9  7 11

Also, we found out how many vowels the names of each state had. Here is how we can do that with stringr:

states %>%
  str_count("a")

##  [1] 3 2 1 2 2 1 0 2 1 1 1 2 1 0 2 1 2 0 2 1 2 2 1 1 0 0 2 2 2 1 0 0 0 2 2
## [36] 0 2 0 2 1 2 2 0 1 1 0 1 1 1 0 0 0

Notice that we are only getting the number of a’s in lower case. Since str_count() does not contain the argument ignore.case, we need to transform all letters to lower case, and then count the number of a’s like this:

states %>%
  tolower() %>%
  str_count("a")

##  [1] 4 3 2 3 2 1 0 2 1 1 1 2 1 0 2 1 2 0 2 1 2 2 1 1 0 0 2 2 2 1 0 0 0 2 2
## [36] 0 2 0 2 1 2 2 0 1 1 0 1 1 1 0 0 0

Now let’s do this for all the vowels:

vowels <- c("a", "e", "i", "o", "u")
states %>% 
  tolower() %>% 
  str_split("") %>% 
  unlist() %>%
  table() ->
  x
x[vowels]

## .
##  a  e  i  o  u 
## 62 29 48 40 10

stringr provides functions for both

basic manipulations
regular expression operations

The following table contains the stringr functions for basic string operations:

Function	Description	Similar to
str_c()	string concatenation	paste()
str_length()	number of characters	nchar()
str_sub()	extracts substrings	substring()
str_dup()	duplicates characters	none
str_trim()	removes leading and trailing whitespace	none
str_pad()	pads a string	none
str_wrap()	wraps a string paragraph	strwrap()
str_trim()	trims a string	none

Here are some examples:

paste("It", "is",  "a", "nice", "day", "today")

## [1] "It is a nice day today"

str_c("It", "is",  "a", "nice", "day", "today")

## [1] "Itisanicedaytoday"

str_c("It", "is",  "a", "nice", "day", "today",
      sep=" ")

## [1] "It is a nice day today"

str_c("It", "is",  "a", "nice", "day", "today",
      sep="-")

## [1] "It-is-a-nice-day-today"

next str_length. Compared to nchar() it can handle more data types, for example factors:

some_factor <- factor(c(1, 1, 1, 2, 2, 2), 
          labels = c("good", "bad"))
some_factor

## [1] good good good bad  bad  bad 
## Levels: good bad

str_length(some_factor)

## [1] 4 4 4 3 3 3

whereas nchar(some_factor) results in an error.

A routine that has no direct equivalent in basic R is str_dup. It is sort of a rep for strings:

str_dup("ab", 2)

## [1] "abab"

str_dup("ab", 1:3)

## [1] "ab"     "abab"   "ababab"

Another handy function that we can find in stringr is str_pad() for padding a string. This is useful if we want to have a nice alignment when printing some text.

Its default usage has the following form:

str_pad(string, width, side = “left”, pad = " “)

The idea of str_pad() is to take a string and pad it with leading or trailing characters to a specified total width. The default padding character is a space (pad = " “), and consequently the returned string will appear to be either left-aligned (side =”left" ), right-aligned (side = “right”), or both (side = “both”).

Let’s see some examples:

str_pad("Great!", width = 7)

## [1] " Great!"

str_pad("Great", width = 8, side = "both")

## [1] " Great  "

str_pad(str_pad("Great!", width = 7), width = 8, pad="#")

## [1] "# Great!"

Often when dealing with character vectors we end up with white spaces. These are easily taken care of with str_trim:

txt <- c("some", " text", "with ", " white ", "space")
str_trim(txt)

## [1] "some"  "text"  "with"  "white" "space"

An operation that one needs to do quite often is to extract the last (few) letters from words. Using substring is tricky if the words don’t have the same lengths:

substring(txt, nchar(txt)-1, nchar(txt))

## [1] "me" "xt" "h " "e " "ce"

Much easier with

str_sub(txt, -2, -1)

## [1] "me" "xt" "h " "e " "ce"

You can use str_wrap() to modify existing whitespace in order to wrap a paragraph of text, such that the length of each line is as similar as possible.

declaration.of.independence <- "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness."
cat(str_wrap(declaration.of.independence, width=40))

## We hold these truths to be self-evident,
## that all men are created equal, that
## they are endowed by their Creator with
## certain unalienable Rights, that among
## these are Life, Liberty and the pursuit
## of Happiness.

Example: Dracula by Bram Stoker

Let’s do a textual analysis of Bram Stoker’s Dracula. We can get an electronic copy of the book from the Project Gutenberg http://www.gutenberg.org/. To get a book into R is very easy, there is a package:

dracula <- gutenberg_download(345) 
dracula

## Error: package 'fansi' was installed by an R version with different internals; it needs to be reinstalled for use with this R version

Why 345? This is the id number used by the Gutenberg web site to identify this book. Go to their website and check out what other books they have (there are over 57000 as of 2018).

The first column is the Gutenberg_id, so we can get rid of that

dracula <- dracula[, 2]

Let’s see the beginning of the book:

dracula[1:100, ] %>% 
  str_wrap(width=40) %>% 
  cat()

## c(" DRACULA", "", "", "", "", "",
## " DRACULA", "", " _by_", "", "
## Bram Stoker", "", " [Illustration:
## colophon]", "", " NEW YORK", "", "
## GROSSET & DUNLAP", "", " _Publishers_",
## "", " Copyright, 1897, in the United
## States of America, according", " to
## Act of Congress, by Bram Stoker", "",
## " [_All rights reserved._]", "", "
## PRINTED IN THE UNITED STATES", " AT",
## " THE COUNTRY LIFE PRESS, GARDEN CITY,
## N.Y.", "", "", "", "", " TO", "", " MY
## DEAR FRIEND", "", " HOMMY-BEG", "", "",
## "", "", "CONTENTS", "", "", "CHAPTER I",
## " Page", "", "Jonathan Harker's Journal
## 1", "", "CHAPTER II", "", "Jonathan
## Harker's Journal 14", "", "CHAPTER
## III", "", "Jonathan Harker's Journal
## 26", "", "CHAPTER IV", "", "Jonathan
## Harker's Journal 38", "", "CHAPTER V",
## "", "Letters--Lucy and Mina 51", "",
## "CHAPTER VI", "", "Mina Murray's Journal
## 59", "", "CHAPTER VII", "", "Cutting
## from \"The Dailygraph,\" 8 August 71",
## "", "CHAPTER VIII", "", "Mina Murray's
## Journal 84", "", "CHAPTER IX", "",
## "Mina Murray's Journal 98", "", "CHAPTER
## X", "", "Mina Murray's Journal 111",
## "", "CHAPTER XI", "", "Lucy Westenra's
## Diary 124", "", "CHAPTER XII", "",
## "Dr. Seward's Diary 136", "", "CHAPTER
## XIII", "", "Dr. Seward's Diary 152",
## "", "CHAPTER XIV", "", "Mina Harker's
## Journal 167")

What are the most commonly used words in the book? Well, it will be something like “a”, “and” etc. Those kinds of words are called stop_words, and they are not very interesting, and it might be better to just take them out. There are lists of such words. One of them is in the library tidytext:

stop_words

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows

So we now want to go through Dracula and remove all the appearances of any of the words in stop_words. This can be done with the dplyr command anti_join. However the two lists need to have the same column names, and in Dracula it is text whereas in stop_words it is word. Again, the library tidytext has a command:

dracula %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) ->
  dracula
dracula

## # A tibble: 48,552 x 1
##    word        
##    <chr>       
##  1 dracula     
##  2 dracula     
##  3 _by_        
##  4 bram        
##  5 stoker      
##  6 illustration
##  7 colophon    
##  8 york        
##  9 grosset     
## 10 dunlap      
## # ... with 48,542 more rows

So now for the most common words:

dracula %>% 
  count(word, sort=TRUE)

## # A tibble: 9,072 x 2
##    word        n
##    <chr>   <int>
##  1 time      390
##  2 van       323
##  3 night     310
##  4 helsing   301
##  5 dear      224
##  6 lucy      223
##  7 day       220
##  8 hand      210
##  9 mina      210
## 10 door      200
## # ... with 9,062 more rows

so time is the most common word, it appears 390 times in the book.

Here is a nice graph called a wordcloud to illustrate word frequencies:

dracula %>% 
  as.character() %>%
  wordcloud(min.freq=100, random.order=FALSE)

Example: Dracula vs The Time Machine

How do the words in Dracula compare to another famous fiction book of the era, The Time Machine, by H. G. Wells? This is book # 35 in the Gutenberg catalog:

time.machine <- gutenberg_download(35)[, 2]
time.machine %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) ->
  time.machine

time.machine %>% 
  count(word, sort=T)

## # A tibble: 4,135 x 2
##    word          n
##    <chr>     <int>
##  1 time        200
##  2 machine      85
##  3 white        59
##  4 traveller    55
##  5 world        52
##  6 hand         49
##  7 morlocks     46
##  8 people       46
##  9 weena        46
## 10 found        44
## # ... with 4,125 more rows

Actually, time is the most common word in both books (not a surprise in a book called The Time Machine!)

Can we do a graphical display of the word frequencies? We will need some routines from yet another package called tidyr.

We begin by joining the two books together, with a new column identifying the book:

freqs <- bind_rows(mutate(dracula, book="Dracula"),
       mutate(time.machine, book="Time.Machine"))
freqs

## # A tibble: 59,663 x 2
##    word         book   
##    <chr>        <chr>  
##  1 dracula      Dracula
##  2 dracula      Dracula
##  3 _by_         Dracula
##  4 bram         Dracula
##  5 stoker       Dracula
##  6 illustration Dracula
##  7 colophon     Dracula
##  8 york         Dracula
##  9 grosset      Dracula
## 10 dunlap       Dracula
## # ... with 59,653 more rows

Next we add some useful columns:

freqs %>% 
  mutate(word=str_extract(word, "[a-z']+")) %>% 
        # take out things like , etc
  count(book, word) %>% 
  group_by(book) %>%
  mutate(prop=n/sum(n)) %>% # take into account 
                  # different lengths of the books
  ungroup() %>% 
  filter(n>10) %>%  # consider only words used frequently
  select(-n) %>%  # not needed anymore
  arrange(desc(prop)) ->
  freqs

Next we find all the words that appear in both books, and look at their relative proportions:

freqs %>% 
   spread(key=book, value = prop) %>% 
            # use one column for Dracula's 
            # proportions and another for Time Machine
   na.omit()  -> # words that appear in only one book are NA, 
                # eliminate them    
  common.words
print(common.words, n=4)

## # A tibble: 103 x 3
##   word        Dracula Time.Machine
##   <chr>         <dbl>        <dbl>
## 1 absolutely 0.000247     0.000990
## 2 age        0.000288     0.00126 
## 3 air        0.00111      0.00207 
## 4 altogether 0.000330     0.000990
## # ... with 99 more rows

common.words %>%
  ggplot(aes(x=Dracula, y=common.words$Time.Machine)) +
    labs(x = "Dracula", 
         y = "Time Machine") +
    geom_text(aes(label = word), 
              check_overlap = TRUE, 
              vjust = 1.5)