character.utf8.md

Working with Characters

Working with character strings is one of the most common tasks in R. In this section we discuss some of the routines we have for that.

Character strings can use single or double quotes:

'this is a string'

## [1] "this is a string"

"this is a string"

## [1] "this is a string"

Say you want to type in a vector of names. Having to type all those quotes is a bit of work, but there is a nice routine in the Hmisc package that helps:

library(Hmisc)
Cs(Joe, Jack, Ann, Laura)

## [1] "Joe"   "Jack"  "Ann"   "Laura"

Resma3 has the data set agesexUS, which has the breakdown of the US population by gender and age according to the 2000 US Census. We are going to work with the names of the states (plus DC and PR) for a bit:

states <- agesexUS$State
head(states)

## [1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"
## [6] "Colorado"

to find out how long a string is use

nchar(states)

##  [1]  7  6  7  8 10  8 11  8 20  7  7  6  5  8  7  4  6  8  9  5  8 13  8
## [24]  9 11  8  7  8  6 13 10 10  8 14 12  4  8  6 12 12 14 12  9  5  4  7
## [47]  8 10 13  9  7 11

What state has the longest name?

states[which.max(nchar(states))]

## [1] "District of Columbia"

Say we want to shorten the strings to just the first three letters:

substring(states, first=1, last=3)

##  [1] "Ala" "Ala" "Ari" "Ark" "Cal" "Col" "Con" "Del" "Dis" "Flo" "Geo"
## [12] "Haw" "Ida" "Ill" "Ind" "Iow" "Kan" "Ken" "Lou" "Mai" "Mar" "Mas"
## [23] "Mic" "Min" "Mis" "Mis" "Mon" "Neb" "Nev" "New" "New" "New" "New"
## [34] "Nor" "Nor" "Ohi" "Okl" "Ore" "Pen" "Rho" "Sou" "Sou" "Ten" "Tex"
## [45] "Uta" "Ver" "Vir" "Was" "Wes" "Wis" "Wyo" "Pue"

Now, though, several of the strings are the same (“Ala”). If that is a problem use

abbreviate(states)[1:6]

##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##     "Albm"     "Alsk"     "Arzn"     "Arkn"     "Clfr"     "Clrd"

This routine figures out what the length of the shortest string is that makes all of them unique (here 4). We can make it a little longer if we want:

abbreviate(states, minlength = 6)[1:6]

##    Alabama     Alaska    Arizona   Arkansas California   Colorado 
##   "Alabam"   "Alaska"   "Arizon"   "Arknss"   "Calfrn"   "Colord"

Notice that this keeps the full strings as names.

Say we want the last 3 letters of the states names:

substring(states, 
          first = nchar(states)-2,
          last = nchar(states))[1:6]

## [1] "ama" "ska" "ona" "sas" "nia" "ado"

Let’s say we want all the states whose name starts with P:

grep("P", states)

## [1] 39 52

tells us those are the states at position 39 and 52, so now

states[grep("P", states)]

## [1] "Pennsylvania" "Puerto Rico"

or directly:

grep("P", states, value = TRUE)

## [1] "Pennsylvania" "Puerto Rico"

Here is another way to do this:

states[startsWith(states, "P")]

## [1] "Pennsylvania" "Puerto Rico"

Very useful is its partner:

states[endsWith(states, "o")]

## [1] "Colorado"    "Idaho"       "New Mexico"  "Ohio"        "Puerto Rico"

This can be used for example to find the files of a certain type in a folder:

dir()[endsWith(dir(), ".Rmd")][1:5]

## [1] "_main.Rmd"       "assign.Rmd"      "basic.stats.Rmd" "bayes.Rmd"      
## [5] "blank.Rmd"

Notice that above we only got the states whose names have a capital P. What if we want all states with either p or P?

grep(pattern = "[pP]", x = states, value = TRUE)

## [1] "Mississippi"   "New Hampshire" "Pennsylvania"  "Puerto Rico"

the syntax [pP] matches either p or P. This is an example of a regular expression, which we will discuss shortly.

Exercise

Find all the states whose name consist of two or more words (like Puerto Rico)

##  [1] "District of Columbia" "New Hampshire"        "New Jersey"          
##  [4] "New Mexico"           "New York"             "North Carolina"      
##  [7] "North Dakota"         "Rhode Island"         "South Carolina"      
## [10] "South Dakota"         "West Virginia"        "Puerto Rico"

We can also use the function tolower, which turns all the letters into lower case:

tolower(states)[1:4]

## [1] "alabama"  "alaska"   "arizona"  "arkansas"

grep("p", tolower(states), value = TRUE)

## [1] "mississippi"   "new hampshire" "pennsylvania"  "puerto rico"

but now all the letters are lower case.

Of course there is also a toupper function.

We alreadu used grep. We also have the grepl function, which which does the same but instead of the location it returns TRUE if the string contains the pattern:

states[1:6]

## [1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"
## [6] "Colorado"

grepl(pattern = "s", states)[1:6]

## [1] FALSE  TRUE FALSE  TRUE FALSE FALSE

Suppose we want to replace all the A’s with *’s:

gsub("A", "*", states)[1:6]

## [1] "*labama"    "*laska"     "*rizona"    "*rkansas"   "California"
## [6] "Colorado"

There is also the sub function, which does the same as gsub but only to the first occurrence:

sub("a", "A", c("abba"))

## [1] "Abba"

gsub("a", "A", c("abba"))

## [1] "AbbA"

Exercise

cat(y, "\n")

s h h j h m i w d i l c z v j h m f n y j n i l j r l v n e j e n p j b h l y i z c o k s j s u b o m u a z m u f e q j o a s t l e l t h d a j k c k k a w e p v e u i z v n z p a b x t z j g h k i j z x h q p a a p h h m s n m u t f t u m a k m y v q p b g n k e e w k p j p r k x w k a x u m n r n

How many a’ are in this string?

You can get the vector into R as follows: copy it and in R run

# Windows
x <- scan("clipboard", what="char")
# Mac
x <- scan(pipe("pbpaste"), what="char")

Let’s ask the following question: what is the distribution of the vowels in the names of the states? For instance, let’s start with the number of a’s in each name. There’s a very useful function for this purpose: gregexpr.

We can use it to get the number of times that a searched pattern is found in a character vector. When there is no match, we get a value -1.

positions_a <- gregexpr(pattern = "a", 
                        text = states, 
                        ignore.case = TRUE)
positions_a[[1]]

## [1] 1 3 5 7
## attr(,"match.length")
## [1] 1 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

tells us that in “Alabama” there are a’s in positions 1, 3, 5 and 7.

Now we need to go through all the states names and find out how many a’s there are in each. Here is a fast way to do this, using one of the apply functions:

f <- function(x) {
  ifelse(x[1] > 0, length(x), 0)
# if there is no a, x is -1, so we get 0  
}
num_a <- sapply(positions_a, f)
num_a

##  [1] 4 3 2 3 2 1 0 2 1 1 1 2 1 0 2 1 2 0 2 1 2 2 1 1 0 0 2 2 2 1 0 0 0 2 2
## [36] 0 2 0 2 1 2 2 0 1 1 0 1 1 1 0 0 0

Now let’s do this for all the vowels:

vowels <- c("a", "e", "i", "o", "u")
num_vowels <- rep(0, 5)
names(num_vowels) <- vowels
for(i in seq_along(vowels)) {
  positions <- gregexpr(pattern = vowels[i], 
                        text = states, 
                        ignore.case = TRUE)
  num_vowels[i] <- sum(sapply(positions, f))
}
num_vowels

##  a  e  i  o  u 
## 62 29 48 40 10

paste, paste0 commands

One of the most useful commands in R is paste. It let’s us put together various parts as a string:

paste(1:3)

## [1] "1" "2" "3"

paste("a", 1:3)

## [1] "a 1" "a 2" "a 3"

paste("a", 1:3, sep=":")

## [1] "a:1" "a:2" "a:3"

paste("a", 1:3, sep="")

## [1] "a1" "a2" "a3"

This last one (no space between) is needed often enough it has its own command:

paste0("a", 1:3)

## [1] "a1" "a2" "a3"

If we want to make a single string use

paste0("a", 1:3, collapse="")

## [1] "a1a2a3"

paste0("a", 1:3, collapse="-")

## [1] "a1-a2-a3"

Exercise

write a routine that generates a licence plate in Puerto Rico at random. For example

license.plate()

## [1] "HDO223"

paste “combines” stuff into a string. Sometimes we want to do the opposite:

txt <- "This is a short sentence"
strsplit(txt, " ")

## [[1]]
## [1] "This"     "is"       "a"        "short"    "sentence"

notice that the result is a list, so often we use

unlist(strsplit(txt, " "))

## [1] "This"     "is"       "a"        "short"    "sentence"

Exercise

Here is Abraham Lincolns famous Gettyburg address:

cat(gettysburg)

 Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth.

How many times did Lincoln use the word “people”?

## [1] 3

Regular Expressions

A regular expression (a.k.a. regex) is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of pattern. Hence we say that a regular expression is a pattern that describes a set of strings.

Tools for working with regular expressions can be found in virtually all scripting languages (e.g. Perl, Python, Java, Ruby, etc). R has some functions for working with regular expressions although it does not provide the wide range of capabilities that other scripting languages do. Nevertheless, they can take us very far with some workarounds (and a bit of patience).

To know more about regular expressions in general, you can find some useful information in the following resources:

Regex wikipedia http://en.wikipedia.org/wiki/Regular_expression
Regular-Expressions.info website (by Jan Goyvaerts) http://www.regular-expressions.info

The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. Simply put, working with regular expressions is nothing more than pattern matching.

The result of a match is either successful or not. The simplest version of pattern matching is to search for one occurrence (or all occurrences) of some specific characters in a string. For example, we might want to search for the word “programming” in a large text document, or we might want to search for all occurrences of the string “apply” in a series of files containing R scripts.

The most important use of regular expressions is in the replacement of a pattern, say using gsub. Regular expressions allow us to not just use specific characters as patterns but much more general things:

Let’s take the vector

cat(txt)

## In 2017 there where 17 hurricanes

Let’s say I want to pick out those elements of the vector that are (or at least could be) numeric. Here is one way to do it:

as.numeric(txt)

## [1]   NA 2017   NA   NA   17   NA

txt[!is.na(as.numeric(txt))]

## [1] "2017" "17"

or we can use regexp:

txt[grepl("\\d", txt)]

## [1] "2017" "17"

Here d stands for digits. The backslash in front is the standard regex syntax, but backslashes in R have special meanings, so we need another one in front. This second one is called an escape character, it tells R to treat the backslash as such, and not as a special character.

Say we want to replace the spaces in a sentence with the underscore. The regex symbol space is s:

gsub("\\s", "_", "Not a very interesting sentence")

## [1] "Not_a_very_interesting_sentence"

We already used [pP] before to match both small and large cap p’s. This is in fact a regular expression. It matches everything between the brackets:

gsub("[0-9]", "%", txt)

## [1] "In"         "%%%%"       "there"      "where"      "%%"        
## [6] "hurricanes"

A caret in front is NOT:

gsub("[^0-9]", "%", txt)

## [1] "%%"         "2017"       "%%%%%"      "%%%%%"      "17"        
## [6] "%%%%%%%%%%"

Exercise

What command would replace all komma’s, dot’s and semicolons in the gettysburg address with an ampersant (@)?

 Four score and seven years ago our fathers brought forth on this continent@ a new nation@ conceived in Liberty@ and dedicated to the proposition that all men are created equal@

Now we are engaged in a great civil war@ testing whether that nation@ or any nation so conceived and so dedicated@ can long endure@ We are met on a great battle-field of that war@ We have come to dedicate a portion of that field@ as a final resting place for those who here gave their lives that that nation might live@ It is altogether fitting and proper that we should do this@

But@ in a larger sense@ we can not dedicate -- we can not consecrate -- we can not hallow -- this ground@ The brave men@ living and dead@ who struggled here@ have consecrated it@ far above our poor power to add or detract@ The world will little note@ nor long remember what we say here@ but it can never forget what they did here@ It is for us the living@ rather@ to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced@ It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation@ under God@ shall have a new birth of freedom -- and that government of the people@ by the people@ for the people@ shall not perish from the earth@

POSIX

Closely related to the regex character classes we have what is known as POSIX character classes. In R, POSIX character classes are represented with expressions inside double brackets [[ ]].

[[:lower:]] Lower-case letters
[[:upper:]] Upper-case letters [[:alpha:]] Alphabetic characters ([[:lower:]] and [[:upper:]])
[[:digit:]] Digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
[[:alnum:]] Alphanumeric characters ([[:alpha:]] and [[:digit:]])
[[:blank:]] Blank characters: space and tab [[:cntrl:]] Control characters
[[:punct:]] Punctuation characters: ! ” # % & ’ ( ) * + , - . / : ;
[[:space:]] Space characters: tab, newline, vertical tab, form feed, carriage return, and space
[[:digit:]] Hexadecimal digits: 0-9 A B C D E F a b c d e f
[[:print:]] Printable characters ([[:alpha:]], [[:punct:]] and space) [[:graph:]] Graphical characters ([[:alpha:]] and [[:punct:]])

so we could also do this

as.numeric(txt[grepl("[[:digit:]]", txt)])

## [1] 2017   17

Some Examples

Palindrome

a palindrome is a word that is the same when read forwards or backwards. Some examples are noon, civic, radar, level, rotor, kayak, reviver, racecar, redder, madam, and refer. Let’s write a sequence of commands that take a sentence and return any palindromes. As an example, consider

txt <- "At Noon the Meteorologist is checking the Radar"

which should result in the vector (“noon”, “radar”).

First we need to split the sentence into words:

wrds <- unlist(strsplit(txt, " "))
wrds

## [1] "At"            "Noon"          "the"           "Meteorologist"
## [5] "is"            "checking"      "the"           "Radar"

Next we need to turn each word around. To do that we split each word into individual letters, reverse them and paste them back together:

n <- length(wrds)
rev.wrds <- rep("", n)
for(i in 1:n) 
  rev.wrds[i] <- paste(unlist(strsplit(wrds[i], ""))[nchar(wrds[i]):1], collapse = "")
rev.wrds

## [1] "tA"            "nooN"          "eht"           "tsigoloroeteM"
## [5] "si"            "gnikcehc"      "eht"           "radaR"

Finally, let’s check whether the words are the same, but taking into account that Noon is still a palindrome!

wrds[tolower(wrds) == tolower(rev.wrds)]

## [1] "Noon"  "Radar"

Email Addresses

Consider the web site of the Math department at http://math.uprm.edu/academic/people.php

Let’s say we want to write a routine that picks out all the email addresses.

First we need to download the web site. This can be done with the scan command because as is explained in the help file the argument can be a connection, which includes URL’s

txt <- scan("http://math.uprm.edu/academic/people.php", 
      what="char", sep="\n")

\n is the newline character, so each line of the webpage will be an element of the vector.

Next we need to figure out what defines an email address. Obviously it needs to have the @ symbol, so let’s go through the text and pick out those lines that have the @ symbol:

sum(unlist(gregexpr("@", txt))>0)

## [1] 236

txt <- grep("@", txt, value = TRUE)
length(txt)

## [1] 5

So the @ symbol appears 236 time. but strangely there are only 5 lines with @ symbols! That is because

substring(txt[1], 1, 500)

## [1] "          <table id=\"people\" cellspacing='0' border='0' cellpadding='2px'><tr><th bgcolor='#024253' width='150px'>Full Name</th> <th bgcolor='#024253' width='150px'>Email</th> <th bgcolor='#024253' width='60px'>Office</th> <th bgcolor='#024253' width='60px'>Phone</th> <th bgcolor='#024253' width='130px'>Position</th></tr><tr bgcolor='#E3EAA6'><td><a href='../people/peoplefind.php?person=Acar, Robert'>Acar, Robert</a></td><td><a href=\"mailto:robert.acar@upr.edu\"> robert.acar@upr.edu</a></td><td>L"

so on the website the addresses are in an html table, which was read in as a single string. We can see that immediately after each email address is the text </a> , which is the html tag to end a link. Let’s split up the text according to the </a> tag:

txt <- unlist(strsplit(paste(txt, collapse=""), "</a>"))
txt[1:2]

## [1] "          <table id=\"people\" cellspacing='0' border='0' cellpadding='2px'><tr><th bgcolor='#024253' width='150px'>Full Name</th> <th bgcolor='#024253' width='150px'>Email</th> <th bgcolor='#024253' width='60px'>Office</th> <th bgcolor='#024253' width='60px'>Phone</th> <th bgcolor='#024253' width='130px'>Position</th></tr><tr bgcolor='#E3EAA6'><td><a href='../people/peoplefind.php?person=Acar, Robert'>Acar, Robert"
## [2] "</td><td><a href=\"mailto:robert.acar@upr.edu\"> robert.acar@upr.edu"

OK, next we have to eliminate all the lines that don’t have a @ in it:

txt <- grep("@", txt, value = TRUE)
length(txt)

## [1] 118

which is good because 2*118=236, and each address appeared twice in the table. So we got all of them.

Finally we need to extract the email address from each line. Checking them we see that just before each address is an empty space, so maybe this will work:

txt <- unlist(str_split(txt, " "))
txt <- grep("@", txt, value = TRUE)
txt[1:4]

## [1] "href=\"mailto:robert.acar@upr.edu\">"
## [2] "robert.acar@upr.edu"                 
## [3] "href=\"mailto:edgar.acuna@upr.edu\">"
## [4] "edgar.acuna@upr.edu"

almost, we just need to get rid of those lines starting with href:

txt <- txt[!grepl("href", txt)]
txt

##   [1] "robert.acar@upr.edu"           "edgar.acuna@upr.edu"          
##   [3] "dorothy.bollman@gmail.com"     "luis.caceres1@upr.edu"        
##   [5] "gabriele.castellini@upr.edu"   "paul.castillo@upr.edu"        
##   [7] "omar.colon4@upr.edu"           "silvestre.colon@upr.edu"      
##   [9] "angel.cruz14@upr.edu"          "stan.dziobiak@upr.edu"        
##  [11] "wieslaw.dziobiak@upr.edu"      "anacarmen.gonzalez@upr.edu"   
##  [13] "marggie.gonzalez@upr.edu"      "darrell.hajek@upr.edu"        
##  [15] "edgardo.lorenzo1@upr.edu"      "flor.narciso@upr.edu"         
##  [17] "victor.ocasio1@upr.edu"        "juan.ortiz35@upr.edu"         
##  [19] "reyes.ortiz@upr.edu"           "arturo.portnoy1@upr.edu"      
##  [21] "wilfredo.quinones2@upr.edu"    "karen.rios3@upr.edu"          
##  [23] "olgamary.rivera@upr.edu"       "yuri.rojas@upr.edu"           
##  [25] "wolfgang.rolke@upr.edu"        "juan.romero4@upr.edu"         
##  [27] "samuel.rosario1@upr.edu"       "krzysztof.rozga@upr.edu"      
##  [29] "hector.salas@upr.edu"          "damaris.santana2@upr.edu"     
##  [31] "freddie.santiago1@upr.edu"     "marko.schutz@upr.edu"         
##  [33] "alexander.shramchenko@upr.edu" "lev.steinberg@upr.edu"        
##  [35] "nilsa.toro@upr.edu"            "pedro.torres14@upr.edu"       
##  [37] "pedro.vasquez@upr.edu"         "alejandro.velez2@upr.edu"     
##  [39] "julio.vidaurraza@upr.edu"      "uroyoan.walker@upr.edu"       
##  [41] "keith.wayland@upr.edu"         "xuerong.yong@upr.edu"         
##  [43] "zoraida.arroyo@upr.edu"        "carmen.gonzalez23@upr.edu"    
##  [45] "tania.lopez@upr.edu"           "javier.mercado3@upr.edu"      
##  [47] "madeline.ramos3@upr.edu"       "robert.trabal@upr.edu"        
##  [49] "luisa.andino@upr.edu"          "julio.barety@upr.edu"         
##  [51] "eliseo.cruz1@upr.edu"          "gladys.dicristina@upr.edu"    
##  [53] "enriquejose.gallo@upr.edu"     "cesar.herrera@upr.edu"        
##  [55] "rafael.martinez13@upr.edu"     "julioc.quintana@upr.edu"      
##  [57] "tokuji.saito@upr.edu"          "arlin.alvarado@upr.edu"       
##  [59] "alcibiades.bustillo@upr.edu"   "andres.chamorro@upr.edu"      
##  [61] "edwin.florez@upr.edu"          "einstein.morales@upr.edu"     
##  [63] "velcy.palomino@upr.edu"        "walter.quispe@upr.edu"        
##  [65] "carlos.theran@upr.edu"         "roberto.trespalacio@upr.edu"  
##  [67] "andrea.angarita@upr.edu"       "hillary.bermudez@upr.edu"     
##  [69] "sergioi.betancourt@upr.edu"    "cesar.bolanos@upr.edu"        
##  [71] "hilda.calderon@upr.edu"        "joseemilio.calderon@upr.edu"  
##  [73] "victor.cardenas@upr.edu"       "alexis.carrillo@upr.edu"      
##  [75] "carlos.carvajal@upr.edu"       "jose.cordoba@upr.edu"         
##  [77] "henrry.cortez@upr.edu"         "saed.cruz@upr.edu"            
##  [79] "victor.diaz16@upr.edu"         "francisco.dejesus3@upr.edu"   
##  [81] "alix.enriquez@upr.edu"         "angel.figueroa1@upr.edu"      
##  [83] "jean.galan@upr.edu"            "cristian.gomez1@upr.edu"      
##  [85] "sergio.gomez@upr.edu"          "cristian.gutierrez1@upr.edu"  
##  [87] "hassam.hayek@upr.edu"          "javier.henriquez@upr.edu"     
##  [89] "sahily.hilerio@upr.edu"        "ricardo.lopez9@upr.edu"       
##  [91] "ruth.lopez5@upr.edu"           "lesbia.lopez@upr.edu"         
##  [93] "christian.lopez29@upr.edu"     "rodrigo.leon@upr.edu"         
##  [95] "alibeth.luna@upr.edu"          "eddie.mendez@upr.edu"         
##  [97] "robert.medina@upr.edu"         "luis.mestre2@upr.edu"         
##  [99] "kevin.molina1@upr.edu"         "bayron.morales@upr.edu"       
## [101] "ana.moreno@upr.edu"            "didier.murillo@upr.edu"       
## [103] "bernnie.murillo@upr.edu"       "daniel.ovalle@upr.edu"        
## [105] "felix.pabon@upr.edu"           "cristian.perdomo@upr.edu"     
## [107] "jessenia.quintero@upr.edu"     "eric.rivera8@upr.edu"         
## [109] "daniel.rocha@upr.edu"          "william.rueda@upr.edu"        
## [111] "jose.santos7@upr.edu"          "deiver.suarez@upr.edu"        
## [113] "maria.torres65@upr.edu"        "ana.trujillo2@upr.edu"        
## [115] "juan.valera@upr.edu"           "raul.valerio@upr.edu"         
## [117] "diana.vargas1@upr.edu"         "cesaraugusto.vega@upr.edu"

If we wanted to send an email to all the people in the department we could now use the write command to copy them to the clipboard, switch to a mail program and copy them into the address box.

Programs like these are routinely used to go through millions of websites and search for email addresses, which are then sent spam emails. This is why I only write mine like this:

wolfgang[dot]rolke[at]upr[dot]edu

Binary Arithmetic

We have previously discussed binary arithmetic. There we used a simple vector of 0’s and 1’s. The main problem with that is that it is hard to vectorize the routines. Instead we will now use character sequences like “1001”.

We have previously written several functions for this. We will want to reuse them but also adapt them to this new format. To do so we need to turn a character string into a vector of numbers and vice versa. Also we want our routines to be vectorized:

Decimal to Binary

decimal.2.binary <- function(x) {
  n <- length(x)
  y <- rep("0", n)
  for(k in 1:n) {
    if(x[k]==0 | x[k]==1) {  # simple cases 
        y[k] <- x[k]
        next
    }    
    i <- floor(log(x[k], base=2)) # largest power of 2 less than x
    bin.x <- rep(1, i+1) # we will need i+1 0'1 and 1's, first is 1
    x[k] <- x[k]-2^i 
    for(j in (i-1):0) {
       if(2^j>x[k]) 
         bin.x[j+1] <- 0
       else {
         bin.x[j+1] <- 1
         x[k] <- x[k]-2^j
       }
    }
    y[k] <- paste(bin.x[length(bin.x):1], collapse="")
  }
  y
}
decimal.2.binary(c(7, 8, 26))

## [1] "111"   "1000"  "11010"

Binary to Decimal

binary.2.decimal <- function(x){
  n <- length(x)
  y <- rep(0, n)
  for(i in 1:n) {
    tmp <- as.numeric(strsplit(x[i], "")[[1]])
    y[i] <- sum(tmp*2^(length(tmp):1-1))
  }
  y
}  
binary.2.decimal(c("111", "1000", "11010"))

## [1]  7  8 26

binary.2.decimal(decimal.2.binary(126))

## [1] 126

decimal.2.binary(binary.2.decimal(c("100101")))

## [1] "100101"

is. binary:

is.binary <- function(x) {
  n <- length(x)
  y <- rep(TRUE, n)
  for(i in 1:n) {
      x.vec <- as.numeric(strsplit(x[i], "")[[1]])
      if(all(x.vec==0)) {
          y[i] <- TRUE 
          next
      }
      x.vec <- x.vec[x.vec!=0]
      x.vec <- x.vec[x.vec!=1]
      if(length(x.vec)==0) y[i] <- TRUE 
      else y[i] <- FALSE 
  }
  y
}  
is.binary(c("1001", "0", "11a1"))

## [1]  TRUE  TRUE FALSE

addition

Here I am going to reuse the routine we had already written:

binary_addition <- function(x, y) {
# First make x and y of equal length and with one extra 
# slot in case it's needed for carry over
# Fill x and y with 0's as needed. 
  n <- length(x)
  m <- length(y)
  N <- max(n, m)+1
  x <- c(rep(0, N-n), x)
  y <- c(rep(0, N-m), y)  
  s <- rep(0, N) #  for result
  ca <- 0 # for carry over term
  for(i in N:1) {
      n <- x[i]+y[i]+ca
      if(n<=1) {# no carry over
        s[i] <- n
        ca <- 0
      }  
      else {# with carry over
        s[i] <- 0
        ca <- 1
      }
    }
  if(s[1]==0) s <- s[-1]# leading 0 removed if necessary
  s
}
binary_addition(c(1, 0), c(1, 1, 0))

## [1] 1 0 0 0

binary.addition<- function(x, y) {
  n <- length(x)
  m <- length(y)
  if(m!=n) cat("Vectors have to have the same length!\n")
  s <- rep("0", n)
  for(i in 1:n) {
    x.vec <- as.numeric(strsplit(x[i], "")[[1]])
    y.vec <- as.numeric(strsplit(y[i], "")[[1]])
    tmp <- binary_addition(x.vec, y.vec)
    s[i] <- paste(tmp, collapse="")
  }
  s
}  
binary.addition(c("0", "10", "1001"), c("0", "110", "1101"))

## [1] "0"     "1000"  "10110"

Let’s turn this into an infix addition operator:

'%+b%' <- function(x, y) binary.addition(x, y)
x <- c("10", "1001", "100101", "11101001") 
y <- c("101", "1001", "10101", "1001100") 
binary.2.decimal(x)

## [1]   2   9  37 233

binary.2.decimal(y)

## [1]  5  9 21 76

z <- x %+b% y 
x

## [1] "10"       "1001"     "100101"   "11101001"

binary.2.decimal(z)

## [1]   7  18  58 309

Let’s define a new class of objects “binary numbers”:

as.binary <- function(x) {
  class(x) <- "binary"
  return(x)
}

what methods might be useful here? Let’s write two:

this is how our number will appear when we use print(x):

print <- function(x) UseMethod("print")
print.binary <- function(x) {
  n <- length(x)
  for(i in 1:length(x)) {
    y <- as.numeric(strsplit(x[i], "")[[1]])
    y <- paste(y, collapse = ".")
    cat(y, "\n")
  }

}
x <- as.binary(c("10", "1001", "100101"))
print(x)

## 1.0 
## 1.0.0.1 
## 1.0.0.1.0.1

summary

what should we calculate as summary statistics? Let’s do three:

how many
most frequent (mode, NA if all only once)
percentage of 0’s

summary <- function(x) UseMethod("summary")
summary.binary <- function(x) {
  n <- length(x)
  if(length(unique(x))==length(x)) mode <- NA
  else {
    z <- table(x)
    z <- z[z==max(z)]
    mode <- names(z)
  }  
  y <- paste(x, collapse = "") # one long string
  y <- as.numeric(strsplit(y, "")[[1]]) # vector of 0's and 1's
  y <- round(sum(y==0)/length(y)*100, 1)
  cat("N =", n, "\n")
  cat("Mode =", mode, "\n")
  cat("% 0's =", y, "\n")
}
x <- sample(1:100, size=1000, replace=TRUE)
x <- as.binary(decimal.2.binary(x))
print(x[1:5])

## [1] "1010101" "1000000" "10110"   "1001010" "1001011"

summary(x)

## N = 1000 
## Mode = 1100100 
## % 0's = 45.4