Input/Output revisited

Libraries used in this section:

library(rio)
library(readr)
library(data.table)
library(pdftools)

There are a number of packages that help with data I/O. Some are specialized for certain data types, others are more general.

rio library

One of my favorites is rio. An introduction can be found at https://cran.r-project.org/web/packages/rio/vignettes/rio.html. The list of supported file formats is quite impressive! You use the import function to import data and the export function to export. The routine uses the extension to figure out the file format. So say you have a file called mytestdata.csv in a folder called c:/tmpdata, just run

import("c:/tmpdata/mytestdata.csv")

Example

B <- 2*1e6
x <- round(rnorm(B), 3)
y <- round(rnorm(B), 3)
z <- sample(letters[1:5], size=B, replace=TRUE)
xyz <- data.frame(x, y, z)
head(xyz, 3)
##        x      y z
## 1  0.599 -0.645 e
## 2 -0.575 -0.232 e
## 3 -0.526 -0.303 b
dir.create("c:/tmpdata")
export(xyz, "c:/tmpdata/mytestdata.csv")
rm(xyz)
head(import("c:/tmpdata/mytestdata.csv"), 3)
##        x      y z
## 1  0.599 -0.645 e
## 2 -0.575 -0.232 e
## 3 -0.526 -0.303 b

rio has the ability to read Minitab files. Unfortunately they have to be in the portable format, and Minitab stopped using that some versions ago. So the easiest thing to do is save files in Minitab as .csv.

readr library

This package is specific to rectangular data, such as data from an Excel spreadsheet. It’s main advantage is it’s speed when compared to the corresponding base R routine read.csv:

tm <- proc.time()
head(read.csv("c:/tmpdata/mytestdata.csv"), 2)
##        x      y z
## 1  0.599 -0.645 e
## 2 -0.575 -0.232 e
(proc.time()-tm)[3]
## elapsed 
##    1.42
tm <- proc.time()
head(import("c:/tmpdata/mytestdata.csv"), 2)
##        x      y z
## 1  0.599 -0.645 e
## 2 -0.575 -0.232 e
(proc.time()-tm)[3]
## elapsed 
##    0.08

Note that the data is in the form of a tibble. This is a special kind of data frame which we will talk about later.

data.table library

Similar to read.table but faster and more convenient. The command is called fread:

tm <- proc.time()
head(fread("c:/tmpdata/mytestdata.csv"), 2)
##         x      y z
## 1:  0.599 -0.645 e
## 2: -0.575 -0.232 e
(proc.time()-tm)[3]
## elapsed 
##    0.07

This command is what I would recommend if you deal with Big Data. These days we classify data as follows:

  • big data (a few hundred thousand rows, about 20 MB)
  • Big Data (5 million rows, about 1GB)
  • Bigger Data (over 100 million rows, over 10GB)

In the case of Bigger Data you can no longer have all of it in memory, but it becomes necessary to use a hard drive as memory. A useful package for that is bigmemory.

pdftools library

The pdf format is the most common document format on the internet, so quite often we are faced with the following problem: we want to extract some data from a pdf. The problem is that pdf files are not plain text and so can not be read directly by R. Here we can use the package pdftools.

Example

Consider the report on the World Mortality Rate 2017. It has a large table with mortality information. We begin by downloading it:

download.file("http://academic.uprm.edu/wrolke/esma6835/World-Mortality-2017-Data-Booklet.pdf", "world-mortality.pdf", mode="wb")

and then turn it into a text file:

txt <- pdf_text("world-mortality.pdf")
nchar(txt)
##  [1]   73 1582 5395 2295 2332 3494 5456 4959 4766 3546   11 5683 5588 6350
## [15] 6156 5955 5928 6258 6232 2625 5764 3288    0    0

Notice that the document has 24 pages, and each is read in as one character string.

The data table starts on page 10, which is txt[12], and ends on page 19, which is txt[20]. Let’s begin by making a long character string with each piece of text/numbers separate. We want to split the string at white spaces, which we can do with the reg expression

strsplit(txt, "\\s+")[[1]]

However, notice that some countries have a superscript (which is also separated from the name by a white space) and that the large numbers have one white space in between also. So what we need to do is split if there are two or more white spaces:

txt <- paste(txt[12:20], collapse = " ")
txt <- strsplit(txt, "\\s{2,}")[[1]]

There is a problem, though: some large numbers are written with a single space between (for example Africa 10 596), so now there is a character vector “10 596” which we want to turn into a number:

as.numeric("10 596")
## [1] NA

so we need to remove those white spaces. It would be easy to remove all of them, but then we would turn “Puerto Rico” into “PuertoRico”, and we don’t want that!

While we are at it, let’s also remove the superscripts on some of the country names.

for(i in seq_along(txt)) {
  tmp <- strsplit(txt[i], "\\s+")[[1]]
  if(length(tmp)==1) next # single item, nothing to do
  if(any(is.na(as.numeric(tmp)))) { # some parts are character
     if(!all(is.na(as.numeric(tmp)))) # not all parts are character
       tmp <- tmp[is.na(as.numeric(tmp))] 
             # drop numbers (superscripts)
      txt[i] <- paste(tmp, collapse = " ")   
          # all text, leave space between
  }  
  else
      txt[i] <- paste(tmp, collapse = "")   
         # all numbers, no spaces
}

For some reasons some are not working (maybe there was more than one white space?), so we fix them directly:

k <- seq_along(txt)[txt=="Western Africa"]
txt[k+c(0,1)]
## [1] "Western Africa" "3"
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="China"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="South-Central Asia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Malaysia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Azerbaijan"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Cyprus"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Republic of Moldova"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Northern Europe"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Norway"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Southern Europe"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Serbia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Spain"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="TFYR Macedonia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Caribbean"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Guadeloupe"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="NORTHERN AMERICA"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Micronesia"]
txt <- txt[-(k+1)]

How can we pick out the parts of text that we want?

Notice that the information starts with Burundi, so we remove everything before that:

k <- seq_along(txt)[txt=="Burundi"]
txt <- txt[k:length(txt)]

The last country is Tonga, so we remove everything after it’s row:

k <- seq_along(txt)[txt=="Tonga"]
txt <- txt[1:(k+13)]

Unfortunately the table is not contiguous, eventually there is a page break and then the title text repeats. However, this part always starts with the words “World Mortality” and ends with “(13)”, so it is easy to remove all of them:

k <- seq_along(txt)[txt=="World Mortality"]
j <- seq_along(txt)[txt=="(13)"]
k
## [1]  155  569 1049 1517 1983 2451 2917 3385
j
## [1]  203  614 1095 1562 2029 2496 2963 3430

so we see that the first “World Mortality” is at position 155, and the first “(13)” is at 203, so we can remove everything in between.

But wait: the top of page 13 reads “World Mortality” on the left and “13”" on the right, but on page 14 it is “14”" on the left and “World Mortality” on the right! These alternate, so we need

k[c(2, 4, 6, 8)] <- k[c(2, 4, 6, 8)]-1

Let’s get rid of all of this:

for(i in 1:8) txt[k[i]:j[i]] <- NA
txt <- txt[!is.na(txt)]

Now we can turn this into a data frame:

data.tbl <- as.data.frame(matrix(txt, byrow = TRUE, ncol=14))
for(i in 2:14) 
  data.tbl[, i] <- as.numeric(data.tbl[, i])
colnames(data.tbl) <- c("Country",
  "Deaths", "Rate", "LifeExpectancy.Both", 
  "LifeExpectancy.Males", "LifeExpectancy.Females", 
  "InfantMortality", "UnderFive", "Prob15.60", "Prob0.70", 
  "PercUnder5", "PercUnder5.25", "Perc25.65", "PercOver65")
row.names(data.tbl) <- NULL

Let’s check a few to make sure we got it right:

k <- seq_along(txt)[txt=="Germany"]
txt[k:(k+13)]
##  [1] "Germany" "910"     "11.1"    "80.8"    "78.4"    "83.2"    "3"      
##  [8] "4"       "71"      "170"     "0"       "0"       "14"      "85"
k <- seq_along(txt)[txt=="Puerto Rico"]
txt[k:(k+13)]
##  [1] "Puerto Rico" "29"          "7.9"         "79.7"        "75.8"       
##  [6] "83.6"        "6"           "7"           "96"          "199"        
## [11] "1"           "1"           "21"          "77"

One final problem: every now and then the table has the means for various regions (Middle Africa, etc). There is no other way but to get rid of them one by one. Just going through the list I can find their row numbers: `

not.country <- c(21, 31, 39, 45, 62, 63, 72, 73, 79, 89,
                 101, 120, 121, 132, 144, 157, 165, 166,
                 184, 193, 207, 210)
data.tbl <- data.tbl[-not.country, ]
dump("data.tbl", "world.mortality.2017.R")

Let’s look at the life expectancy, sorted from highest to lowest:

dta <- data.tbl[order(data.tbl[, "LifeExpectancy.Both"], 
          decreasing = TRUE), c(1, 4)]
colnames(dta)[2] <- "Life Expectancy"
kable.nice(dta, do.row.names = FALSE)
Country Life Expectancy
China, Hong Kong SAR 83.8
China, Macao SAR 83.7
Japan 83.6
Switzerland 83.1
Spain 83.0
Singapore 82.8
Italy 82.8
Australia 82.7
Iceland 82.6
Australia/New Zealand 82.6
France 82.4
Israel 82.3
Sweden 82.3
Canada 82.2
Norway 82.0
Republic of Korea 81.9
Martinique 81.8
Netherlands 81.7
New Zealand 81.7
Luxembourg 81.6
United Kingdom 81.4
Austria 81.4
Ireland 81.3
Finland 81.1
Guadeloupe 81.1
Channel Islands 81.0
Greece 81.0
Portugal 81.0
Belgium 81.0
Slovenia 80.8
Germany 80.8
Malta 80.7
Denmark 80.6
Cyprus 80.3
Réunion 80.1
Mayotte 79.9
China, Taiwan Province of China 79.7
Puerto Rico 79.7
French Guiana 79.7
Cuba 79.6
United States Virgin Islands 79.6
Costa Rica 79.6
Lebanon 79.4
Guam 79.4
Chile 79.3
United States of America 79.2
Czechia 78.6
Curaçao 78.3
Albania 78.2
Qatar 78.0
Panama 77.8
Croatia 77.5
Poland 77.4
Estonia 77.4
Uruguay 77.3
United Arab Emirates 77.1
Maldives 77.0
Brunei Darussalam 77.0
Montenegro 76.9
Mexico 76.9
Bahrain 76.8
Oman 76.8
New Caledonia 76.8
Slovakia 76.7
Bosnia and Herzegovina 76.7
French Polynesia 76.6
Argentina 76.4
Antigua and Barbuda 76.2
China 76.1
Viet Nam 76.1
Ecuador 76.1
Algeria 75.9
Hungary 75.8
Barbados 75.8
Jamaica 75.8
Iran (Islamic Republic of) 75.7
Aruba 75.7
Morocco 75.6
Tunisia 75.5
Turkey 75.5
TFYR Macedonia 75.5
Bahamas 75.5
Romania 75.3
Saint Lucia 75.3
Brazil 75.3
Sri Lanka 75.1
Malaysia 75.1
Thailand 75.1
Serbia 75.1
Nicaragua 75.1
Polynesia 75.1
Samoa 74.8
Peru 74.7
Mauritius 74.6
Kuwait 74.6
Bulgaria 74.6
Lithuania 74.5
Armenia 74.4
Saudi Arabia 74.4
Latvia 74.4
Venezuela (Bolivarian Republic of) 74.4
Jordan 74.2
Colombia 74.2
Dominican Republic 73.7
Grenada 73.5
Honduras 73.4
Seychelles 73.3
State of Palestine 73.3
El Salvador 73.3
Micronesia 73.3
Guatemala 73.2
Georgia 73.1
Saint Vincent and the Grenadines 73.1
Paraguay 73.0
Tonga 72.9
Belarus 72.7
Cabo Verde 72.6
Bangladesh 72.2
Azerbaijan 71.9
Vanuatu 71.9
Libya 71.8
Ukraine 71.8
Dem. People’s Republic of Korea 71.5
Republic of Moldova 71.5
Egypt 71.3
Suriname 71.3
Uzbekistan 71.2
Tajikistan 70.9
Russian Federation 70.9
Kyrgyzstan 70.8
Trinidad and Tobago 70.6
Solomon Islands 70.4
Belize 70.2
Fiji 70.1
Nepal 69.9
Syrian Arab Republic 69.9
Bhutan 69.8
Kazakhstan 69.7
Iraq 69.7
Western Sahara 69.2
Mongolia 69.1
Micronesia (Fed. States of) 69.1
Indonesia 69.0
Philippines 69.0
Bolivia (Plurinational State of) 68.8
Cambodia 68.6
Timor-Leste 68.6
India 68.3
Turkmenistan 67.7
Senegal 66.8
Kenya 66.7
Rwanda 66.7
Sao Tome and Principe 66.5
Myanmar 66.5
Guyana 66.5
Melanesia 66.5
Pakistan 66.3
Lao People’s Democratic Republic 66.3
Kiribati 66.1
Botswana 65.8
Gabon 65.7
Madagascar 65.5
Papua New Guinea 65.4
Ethiopia 65.0
United Republic of Tanzania 65.0
Yemen 64.7
Eritrea 64.6
Sudan 64.3
Congo 64.1
Namibia 63.8
Comoros 63.5
Afghanistan 63.3
Mauritania 63.1
Haiti 63.1
Malawi 62.7
Ghana 62.4
Djibouti 62.3
South Africa 62.0
Liberia 62.0
Zambia 61.4
Angola 61.2
Gambia 61.0
Benin 60.6
Zimbabwe 60.4
Burkina Faso 59.9
Togo 59.9
Niger 59.7
Uganda 59.6
Guinea 59.4
Democratic Republic of the Congo 59.2
Mozambique 57.7
Cameroon 57.6
Mali 57.5
Equatorial Guinea 57.4
Burundi 57.1
Swaziland 57.1
Guinea-Bissau 57.0
South Sudan 56.3
Somalia 55.9
Lesotho 53.7
Côte d’Ivoire 53.1
Nigeria 53.0
Chad 52.6
Central African Republic 51.4
Sierra Leone 51.4