io1.utf8.md

rio library

One of my favorites is rio. An introduction can be found at https://cran.r-project.org/web/packages/rio/vignettes/rio.html. The list of supported file formats is quite impressive! You use the import function to import data and the export function to export. The routine uses the extension to figure out the file format. So say you have a file called mytestdata.csv in a folder called c:/tmpdata, just run

import("c:/tmpdata/mytestdata.csv")

Example

B <- 2*1e6
x <- round(rnorm(B), 3)
y <- round(rnorm(B), 3)
z <- sample(letters[1:5], size=B, replace=TRUE)
xyz <- data.frame(x, y, z)
head(xyz, 3)

##        x      y z
## 1  0.599 -0.645 e
## 2 -0.575 -0.232 e
## 3 -0.526 -0.303 b

dir.create("c:/tmpdata")
export(xyz, "c:/tmpdata/mytestdata.csv")
rm(xyz)

head(import("c:/tmpdata/mytestdata.csv"), 3)

##        x      y z
## 1  0.599 -0.645 e
## 2 -0.575 -0.232 e
## 3 -0.526 -0.303 b

rio has the ability to read Minitab files. Unfortunately they have to be in the portable format, and Minitab stopped using that some versions ago. So the easiest thing to do is save files in Minitab as .csv.

readr library

This package is specific to rectangular data, such as data from an Excel spreadsheet. It’s main advantage is it’s speed when compared to the corresponding base R routine read.csv:

tm <- proc.time()
head(read.csv("c:/tmpdata/mytestdata.csv"), 2)

##        x      y z
## 1  0.599 -0.645 e
## 2 -0.575 -0.232 e

(proc.time()-tm)[3]

## elapsed 
##    1.42

tm <- proc.time()
head(import("c:/tmpdata/mytestdata.csv"), 2)

##        x      y z
## 1  0.599 -0.645 e
## 2 -0.575 -0.232 e

(proc.time()-tm)[3]

## elapsed 
##    0.08

Note that the data is in the form of a tibble. This is a special kind of data frame which we will talk about later.

data.table library

Similar to read.table but faster and more convenient. The command is called fread:

tm <- proc.time()
head(fread("c:/tmpdata/mytestdata.csv"), 2)

##         x      y z
## 1:  0.599 -0.645 e
## 2: -0.575 -0.232 e

(proc.time()-tm)[3]

## elapsed 
##    0.07

This command is what I would recommend if you deal with Big Data. These days we classify data as follows:

big data (a few hundred thousand rows, about 20 MB)
Big Data (5 million rows, about 1GB)
Bigger Data (over 100 million rows, over 10GB)

In the case of Bigger Data you can no longer have all of it in memory, but it becomes necessary to use a hard drive as memory. A useful package for that is bigmemory.

pdftools library

The pdf format is the most common document format on the internet, so quite often we are faced with the following problem: we want to extract some data from a pdf. The problem is that pdf files are not plain text and so can not be read directly by R. Here we can use the package pdftools.

Example

Consider the report on the World Mortality Rate 2017. It has a large table with mortality information. We begin by downloading it:

download.file("http://academic.uprm.edu/wrolke/esma6835/World-Mortality-2017-Data-Booklet.pdf", "world-mortality.pdf", mode="wb")

and then turn it into a text file:

txt <- pdf_text("world-mortality.pdf")
nchar(txt)

##  [1]   73 1582 5395 2295 2332 3494 5456 4959 4766 3546   11 5683 5588 6350
## [15] 6156 5955 5928 6258 6232 2625 5764 3288    0    0

Notice that the document has 24 pages, and each is read in as one character string.

The data table starts on page 10, which is txt[12], and ends on page 19, which is txt[20]. Let’s begin by making a long character string with each piece of text/numbers separate. We want to split the string at white spaces, which we can do with the reg expression

strsplit(txt, "\\s+")[[1]]

However, notice that some countries have a superscript (which is also separated from the name by a white space) and that the large numbers have one white space in between also. So what we need to do is split if there are two or more white spaces:

txt <- paste(txt[12:20], collapse = " ")
txt <- strsplit(txt, "\\s{2,}")[[1]]

There is a problem, though: some large numbers are written with a single space between (for example Africa 10 596), so now there is a character vector “10 596” which we want to turn into a number:

as.numeric("10 596")

## [1] NA

so we need to remove those white spaces. It would be easy to remove all of them, but then we would turn “Puerto Rico” into “PuertoRico”, and we don’t want that!

While we are at it, let’s also remove the superscripts on some of the country names.

for(i in seq_along(txt)) {
  tmp <- strsplit(txt[i], "\\s+")[[1]]
  if(length(tmp)==1) next # single item, nothing to do
  if(any(is.na(as.numeric(tmp)))) { # some parts are character
     if(!all(is.na(as.numeric(tmp)))) # not all parts are character
       tmp <- tmp[is.na(as.numeric(tmp))] 
             # drop numbers (superscripts)
      txt[i] <- paste(tmp, collapse = " ")   
          # all text, leave space between
  }  
  else
      txt[i] <- paste(tmp, collapse = "")   
         # all numbers, no spaces
}

For some reasons some are not working (maybe there was more than one white space?), so we fix them directly:

k <- seq_along(txt)[txt=="Western Africa"]
txt[k+c(0,1)]

## [1] "Western Africa" "3"

txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="China"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="South-Central Asia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Malaysia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Azerbaijan"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Cyprus"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Republic of Moldova"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Northern Europe"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Norway"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Southern Europe"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Serbia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Spain"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="TFYR Macedonia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Caribbean"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Guadeloupe"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="NORTHERN AMERICA"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Micronesia"]
txt <- txt[-(k+1)]

How can we pick out the parts of text that we want?

Notice that the information starts with Burundi, so we remove everything before that:

k <- seq_along(txt)[txt=="Burundi"]
txt <- txt[k:length(txt)]

The last country is Tonga, so we remove everything after it’s row:

k <- seq_along(txt)[txt=="Tonga"]
txt <- txt[1:(k+13)]

Unfortunately the table is not contiguous, eventually there is a page break and then the title text repeats. However, this part always starts with the words “World Mortality” and ends with “(13)”, so it is easy to remove all of them:

k <- seq_along(txt)[txt=="World Mortality"]
j <- seq_along(txt)[txt=="(13)"]
k

## [1]  155  569 1049 1517 1983 2451 2917 3385

## [1]  203  614 1095 1562 2029 2496 2963 3430

so we see that the first “World Mortality” is at position 155, and the first “(13)” is at 203, so we can remove everything in between.

But wait: the top of page 13 reads “World Mortality” on the left and “13”" on the right, but on page 14 it is “14”" on the left and “World Mortality” on the right! These alternate, so we need

k[c(2, 4, 6, 8)] <- k[c(2, 4, 6, 8)]-1

Let’s get rid of all of this:

for(i in 1:8) txt[k[i]:j[i]] <- NA
txt <- txt[!is.na(txt)]

Now we can turn this into a data frame:

data.tbl <- as.data.frame(matrix(txt, byrow = TRUE, ncol=14))
for(i in 2:14) 
  data.tbl[, i] <- as.numeric(data.tbl[, i])
colnames(data.tbl) <- c("Country",
  "Deaths", "Rate", "LifeExpectancy.Both", 
  "LifeExpectancy.Males", "LifeExpectancy.Females", 
  "InfantMortality", "UnderFive", "Prob15.60", "Prob0.70", 
  "PercUnder5", "PercUnder5.25", "Perc25.65", "PercOver65")
row.names(data.tbl) <- NULL

Let’s check a few to make sure we got it right:

k <- seq_along(txt)[txt=="Germany"]
txt[k:(k+13)]

##  [1] "Germany" "910"     "11.1"    "80.8"    "78.4"    "83.2"    "3"      
##  [8] "4"       "71"      "170"     "0"       "0"       "14"      "85"

k <- seq_along(txt)[txt=="Puerto Rico"]
txt[k:(k+13)]

##  [1] "Puerto Rico" "29"          "7.9"         "79.7"        "75.8"       
##  [6] "83.6"        "6"           "7"           "96"          "199"        
## [11] "1"           "1"           "21"          "77"

One final problem: every now and then the table has the means for various regions (Middle Africa, etc). There is no other way but to get rid of them one by one. Just going through the list I can find their row numbers: `

not.country <- c(21, 31, 39, 45, 62, 63, 72, 73, 79, 89,
                 101, 120, 121, 132, 144, 157, 165, 166,
                 184, 193, 207, 210)
data.tbl <- data.tbl[-not.country, ]

dump("data.tbl", "world.mortality.2017.R")

Let’s look at the life expectancy, sorted from highest to lowest:

dta <- data.tbl[order(data.tbl[, "LifeExpectancy.Both"], 
          decreasing = TRUE), c(1, 4)]
colnames(dta)[2] <- "Life Expectancy"
kable.nice(dta, do.row.names = FALSE)

Country	Life Expectancy
China, Hong Kong SAR	83.8
China, Macao SAR	83.7
Japan	83.6
Switzerland	83.1
Spain	83.0
Singapore	82.8
Italy	82.8
Australia	82.7
Iceland	82.6
Australia/New Zealand	82.6
France	82.4
Israel	82.3
Sweden	82.3
Canada	82.2
Norway	82.0
Republic of Korea	81.9
Martinique	81.8
Netherlands	81.7
New Zealand	81.7
Luxembourg	81.6
United Kingdom	81.4
Austria	81.4
Ireland	81.3
Finland	81.1
Guadeloupe	81.1
Channel Islands	81.0
Greece	81.0
Portugal	81.0
Belgium	81.0
Slovenia	80.8
Germany	80.8
Malta	80.7
Denmark	80.6
Cyprus	80.3
Réunion	80.1
Mayotte	79.9
China, Taiwan Province of China	79.7
Puerto Rico	79.7
French Guiana	79.7
Cuba	79.6
United States Virgin Islands	79.6
Costa Rica	79.6
Lebanon	79.4
Guam	79.4
Chile	79.3
United States of America	79.2
Czechia	78.6
Curaçao	78.3
Albania	78.2
Qatar	78.0
Panama	77.8
Croatia	77.5
Poland	77.4
Estonia	77.4
Uruguay	77.3
United Arab Emirates	77.1
Maldives	77.0
Brunei Darussalam	77.0
Montenegro	76.9
Mexico	76.9
Bahrain	76.8
Oman	76.8
New Caledonia	76.8
Slovakia	76.7
Bosnia and Herzegovina	76.7
French Polynesia	76.6
Argentina	76.4
Antigua and Barbuda	76.2
China	76.1
Viet Nam	76.1
Ecuador	76.1
Algeria	75.9
Hungary	75.8
Barbados	75.8
Jamaica	75.8
Iran (Islamic Republic of)	75.7
Aruba	75.7
Morocco	75.6
Tunisia	75.5
Turkey	75.5
TFYR Macedonia	75.5
Bahamas	75.5
Romania	75.3
Saint Lucia	75.3
Brazil	75.3
Sri Lanka	75.1
Malaysia	75.1
Thailand	75.1
Serbia	75.1
Nicaragua	75.1
Polynesia	75.1
Samoa	74.8
Peru	74.7
Mauritius	74.6
Kuwait	74.6
Bulgaria	74.6
Lithuania	74.5
Armenia	74.4
Saudi Arabia	74.4
Latvia	74.4
Venezuela (Bolivarian Republic of)	74.4
Jordan	74.2
Colombia	74.2
Dominican Republic	73.7
Grenada	73.5
Honduras	73.4
Seychelles	73.3
State of Palestine	73.3
El Salvador	73.3
Micronesia	73.3
Guatemala	73.2
Georgia	73.1
Saint Vincent and the Grenadines	73.1
Paraguay	73.0
Tonga	72.9
Belarus	72.7
Cabo Verde	72.6
Bangladesh	72.2
Azerbaijan	71.9
Vanuatu	71.9
Libya	71.8
Ukraine	71.8
Dem. People’s Republic of Korea	71.5
Republic of Moldova	71.5
Egypt	71.3
Suriname	71.3
Uzbekistan	71.2
Tajikistan	70.9
Russian Federation	70.9
Kyrgyzstan	70.8
Trinidad and Tobago	70.6
Solomon Islands	70.4
Belize	70.2
Fiji	70.1
Nepal	69.9
Syrian Arab Republic	69.9
Bhutan	69.8
Kazakhstan	69.7
Iraq	69.7
Western Sahara	69.2
Mongolia	69.1
Micronesia (Fed. States of)	69.1
Indonesia	69.0
Philippines	69.0
Bolivia (Plurinational State of)	68.8
Cambodia	68.6
Timor-Leste	68.6
India	68.3
Turkmenistan	67.7
Senegal	66.8
Kenya	66.7
Rwanda	66.7
Sao Tome and Principe	66.5
Myanmar	66.5
Guyana	66.5
Melanesia	66.5
Pakistan	66.3
Lao People’s Democratic Republic	66.3
Kiribati	66.1
Botswana	65.8
Gabon	65.7
Madagascar	65.5
Papua New Guinea	65.4
Ethiopia	65.0
United Republic of Tanzania	65.0
Yemen	64.7
Eritrea	64.6
Sudan	64.3
Congo	64.1
Namibia	63.8
Comoros	63.5
Afghanistan	63.3
Mauritania	63.1
Haiti	63.1
Malawi	62.7
Ghana	62.4
Djibouti	62.3
South Africa	62.0
Liberia	62.0
Zambia	61.4
Angola	61.2
Gambia	61.0
Benin	60.6
Zimbabwe	60.4
Burkina Faso	59.9
Togo	59.9
Niger	59.7
Uganda	59.6
Guinea	59.4
Democratic Republic of the Congo	59.2
Mozambique	57.7
Cameroon	57.6
Mali	57.5
Equatorial Guinea	57.4
Burundi	57.1
Swaziland	57.1
Guinea-Bissau	57.0
South Sudan	56.3
Somalia	55.9
Lesotho	53.7
Côte d’Ivoire	53.1
Nigeria	53.0
Chad	52.6
Central African Republic	51.4
Sierra Leone	51.4

Input/Output revisited

rio library

Example

readr library

data.table library

pdftools library

Example