Libraries used in this section:
library(rio)
library(readr)
library(data.table)
library(pdftools)
There are a number of packages that help with data I/O. Some are specialized for certain data types, others are more general.
One of my favorites is rio. An introduction can be found at https://cran.r-project.org/web/packages/rio/vignettes/rio.html. The list of supported file formats is quite impressive! You use the import function to import data and the export function to export. The routine uses the extension to figure out the file format. So say you have a file called mytestdata.csv in a folder called c:/tmpdata, just run
import("c:/tmpdata/mytestdata.csv")
B <- 2*1e6
x <- round(rnorm(B), 3)
y <- round(rnorm(B), 3)
z <- sample(letters[1:5], size=B, replace=TRUE)
xyz <- data.frame(x, y, z)
head(xyz, 3)
## x y z
## 1 0.599 -0.645 e
## 2 -0.575 -0.232 e
## 3 -0.526 -0.303 b
dir.create("c:/tmpdata")
export(xyz, "c:/tmpdata/mytestdata.csv")
rm(xyz)
head(import("c:/tmpdata/mytestdata.csv"), 3)
## x y z
## 1 0.599 -0.645 e
## 2 -0.575 -0.232 e
## 3 -0.526 -0.303 b
rio has the ability to read Minitab files. Unfortunately they have to be in the portable format, and Minitab stopped using that some versions ago. So the easiest thing to do is save files in Minitab as .csv.
This package is specific to rectangular data, such as data from an Excel spreadsheet. It’s main advantage is it’s speed when compared to the corresponding base R routine read.csv:
tm <- proc.time()
head(read.csv("c:/tmpdata/mytestdata.csv"), 2)
## x y z
## 1 0.599 -0.645 e
## 2 -0.575 -0.232 e
(proc.time()-tm)[3]
## elapsed
## 1.42
tm <- proc.time()
head(import("c:/tmpdata/mytestdata.csv"), 2)
## x y z
## 1 0.599 -0.645 e
## 2 -0.575 -0.232 e
(proc.time()-tm)[3]
## elapsed
## 0.08
Note that the data is in the form of a tibble. This is a special kind of data frame which we will talk about later.
Similar to read.table but faster and more convenient. The command is called fread:
tm <- proc.time()
head(fread("c:/tmpdata/mytestdata.csv"), 2)
## x y z
## 1: 0.599 -0.645 e
## 2: -0.575 -0.232 e
(proc.time()-tm)[3]
## elapsed
## 0.07
This command is what I would recommend if you deal with Big Data. These days we classify data as follows:
In the case of Bigger Data you can no longer have all of it in memory, but it becomes necessary to use a hard drive as memory. A useful package for that is bigmemory.
The pdf format is the most common document format on the internet, so quite often we are faced with the following problem: we want to extract some data from a pdf. The problem is that pdf files are not plain text and so can not be read directly by R. Here we can use the package pdftools.
Consider the report on the World Mortality Rate 2017. It has a large table with mortality information. We begin by downloading it:
download.file("http://academic.uprm.edu/wrolke/esma6835/World-Mortality-2017-Data-Booklet.pdf", "world-mortality.pdf", mode="wb")
and then turn it into a text file:
txt <- pdf_text("world-mortality.pdf")
nchar(txt)
## [1] 73 1582 5395 2295 2332 3494 5456 4959 4766 3546 11 5683 5588 6350
## [15] 6156 5955 5928 6258 6232 2625 5764 3288 0 0
Notice that the document has 24 pages, and each is read in as one character string.
The data table starts on page 10, which is txt[12], and ends on page 19, which is txt[20]. Let’s begin by making a long character string with each piece of text/numbers separate. We want to split the string at white spaces, which we can do with the reg expression
strsplit(txt, "\\s+")[[1]]
However, notice that some countries have a superscript (which is also separated from the name by a white space) and that the large numbers have one white space in between also. So what we need to do is split if there are two or more white spaces:
txt <- paste(txt[12:20], collapse = " ")
txt <- strsplit(txt, "\\s{2,}")[[1]]
There is a problem, though: some large numbers are written with a single space between (for example Africa 10 596), so now there is a character vector “10 596” which we want to turn into a number:
as.numeric("10 596")
## [1] NA
so we need to remove those white spaces. It would be easy to remove all of them, but then we would turn “Puerto Rico” into “PuertoRico”, and we don’t want that!
While we are at it, let’s also remove the superscripts on some of the country names.
for(i in seq_along(txt)) {
tmp <- strsplit(txt[i], "\\s+")[[1]]
if(length(tmp)==1) next # single item, nothing to do
if(any(is.na(as.numeric(tmp)))) { # some parts are character
if(!all(is.na(as.numeric(tmp)))) # not all parts are character
tmp <- tmp[is.na(as.numeric(tmp))]
# drop numbers (superscripts)
txt[i] <- paste(tmp, collapse = " ")
# all text, leave space between
}
else
txt[i] <- paste(tmp, collapse = "")
# all numbers, no spaces
}
For some reasons some are not working (maybe there was more than one white space?), so we fix them directly:
k <- seq_along(txt)[txt=="Western Africa"]
txt[k+c(0,1)]
## [1] "Western Africa" "3"
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="China"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="South-Central Asia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Malaysia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Azerbaijan"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Cyprus"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Republic of Moldova"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Northern Europe"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Norway"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Southern Europe"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Serbia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Spain"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="TFYR Macedonia"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Caribbean"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Guadeloupe"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="NORTHERN AMERICA"]
txt <- txt[-(k+1)]
k <- seq_along(txt)[txt=="Micronesia"]
txt <- txt[-(k+1)]
How can we pick out the parts of text that we want?
Notice that the information starts with Burundi, so we remove everything before that:
k <- seq_along(txt)[txt=="Burundi"]
txt <- txt[k:length(txt)]
The last country is Tonga, so we remove everything after it’s row:
k <- seq_along(txt)[txt=="Tonga"]
txt <- txt[1:(k+13)]
Unfortunately the table is not contiguous, eventually there is a page break and then the title text repeats. However, this part always starts with the words “World Mortality” and ends with “(13)”, so it is easy to remove all of them:
k <- seq_along(txt)[txt=="World Mortality"]
j <- seq_along(txt)[txt=="(13)"]
k
## [1] 155 569 1049 1517 1983 2451 2917 3385
j
## [1] 203 614 1095 1562 2029 2496 2963 3430
so we see that the first “World Mortality” is at position 155, and the first “(13)” is at 203, so we can remove everything in between.
But wait: the top of page 13 reads “World Mortality” on the left and “13”" on the right, but on page 14 it is “14”" on the left and “World Mortality” on the right! These alternate, so we need
k[c(2, 4, 6, 8)] <- k[c(2, 4, 6, 8)]-1
Let’s get rid of all of this:
for(i in 1:8) txt[k[i]:j[i]] <- NA
txt <- txt[!is.na(txt)]
Now we can turn this into a data frame:
data.tbl <- as.data.frame(matrix(txt, byrow = TRUE, ncol=14))
for(i in 2:14)
data.tbl[, i] <- as.numeric(data.tbl[, i])
colnames(data.tbl) <- c("Country",
"Deaths", "Rate", "LifeExpectancy.Both",
"LifeExpectancy.Males", "LifeExpectancy.Females",
"InfantMortality", "UnderFive", "Prob15.60", "Prob0.70",
"PercUnder5", "PercUnder5.25", "Perc25.65", "PercOver65")
row.names(data.tbl) <- NULL
Let’s check a few to make sure we got it right:
k <- seq_along(txt)[txt=="Germany"]
txt[k:(k+13)]
## [1] "Germany" "910" "11.1" "80.8" "78.4" "83.2" "3"
## [8] "4" "71" "170" "0" "0" "14" "85"
k <- seq_along(txt)[txt=="Puerto Rico"]
txt[k:(k+13)]
## [1] "Puerto Rico" "29" "7.9" "79.7" "75.8"
## [6] "83.6" "6" "7" "96" "199"
## [11] "1" "1" "21" "77"
One final problem: every now and then the table has the means for various regions (Middle Africa, etc). There is no other way but to get rid of them one by one. Just going through the list I can find their row numbers: `
not.country <- c(21, 31, 39, 45, 62, 63, 72, 73, 79, 89,
101, 120, 121, 132, 144, 157, 165, 166,
184, 193, 207, 210)
data.tbl <- data.tbl[-not.country, ]
dump("data.tbl", "world.mortality.2017.R")
Let’s look at the life expectancy, sorted from highest to lowest:
dta <- data.tbl[order(data.tbl[, "LifeExpectancy.Both"],
decreasing = TRUE), c(1, 4)]
colnames(dta)[2] <- "Life Expectancy"
kable.nice(dta, do.row.names = FALSE)
Country | Life Expectancy |
---|---|
China, Hong Kong SAR | 83.8 |
China, Macao SAR | 83.7 |
Japan | 83.6 |
Switzerland | 83.1 |
Spain | 83.0 |
Singapore | 82.8 |
Italy | 82.8 |
Australia | 82.7 |
Iceland | 82.6 |
Australia/New Zealand | 82.6 |
France | 82.4 |
Israel | 82.3 |
Sweden | 82.3 |
Canada | 82.2 |
Norway | 82.0 |
Republic of Korea | 81.9 |
Martinique | 81.8 |
Netherlands | 81.7 |
New Zealand | 81.7 |
Luxembourg | 81.6 |
United Kingdom | 81.4 |
Austria | 81.4 |
Ireland | 81.3 |
Finland | 81.1 |
Guadeloupe | 81.1 |
Channel Islands | 81.0 |
Greece | 81.0 |
Portugal | 81.0 |
Belgium | 81.0 |
Slovenia | 80.8 |
Germany | 80.8 |
Malta | 80.7 |
Denmark | 80.6 |
Cyprus | 80.3 |
Réunion | 80.1 |
Mayotte | 79.9 |
China, Taiwan Province of China | 79.7 |
Puerto Rico | 79.7 |
French Guiana | 79.7 |
Cuba | 79.6 |
United States Virgin Islands | 79.6 |
Costa Rica | 79.6 |
Lebanon | 79.4 |
Guam | 79.4 |
Chile | 79.3 |
United States of America | 79.2 |
Czechia | 78.6 |
Curaçao | 78.3 |
Albania | 78.2 |
Qatar | 78.0 |
Panama | 77.8 |
Croatia | 77.5 |
Poland | 77.4 |
Estonia | 77.4 |
Uruguay | 77.3 |
United Arab Emirates | 77.1 |
Maldives | 77.0 |
Brunei Darussalam | 77.0 |
Montenegro | 76.9 |
Mexico | 76.9 |
Bahrain | 76.8 |
Oman | 76.8 |
New Caledonia | 76.8 |
Slovakia | 76.7 |
Bosnia and Herzegovina | 76.7 |
French Polynesia | 76.6 |
Argentina | 76.4 |
Antigua and Barbuda | 76.2 |
China | 76.1 |
Viet Nam | 76.1 |
Ecuador | 76.1 |
Algeria | 75.9 |
Hungary | 75.8 |
Barbados | 75.8 |
Jamaica | 75.8 |
Iran (Islamic Republic of) | 75.7 |
Aruba | 75.7 |
Morocco | 75.6 |
Tunisia | 75.5 |
Turkey | 75.5 |
TFYR Macedonia | 75.5 |
Bahamas | 75.5 |
Romania | 75.3 |
Saint Lucia | 75.3 |
Brazil | 75.3 |
Sri Lanka | 75.1 |
Malaysia | 75.1 |
Thailand | 75.1 |
Serbia | 75.1 |
Nicaragua | 75.1 |
Polynesia | 75.1 |
Samoa | 74.8 |
Peru | 74.7 |
Mauritius | 74.6 |
Kuwait | 74.6 |
Bulgaria | 74.6 |
Lithuania | 74.5 |
Armenia | 74.4 |
Saudi Arabia | 74.4 |
Latvia | 74.4 |
Venezuela (Bolivarian Republic of) | 74.4 |
Jordan | 74.2 |
Colombia | 74.2 |
Dominican Republic | 73.7 |
Grenada | 73.5 |
Honduras | 73.4 |
Seychelles | 73.3 |
State of Palestine | 73.3 |
El Salvador | 73.3 |
Micronesia | 73.3 |
Guatemala | 73.2 |
Georgia | 73.1 |
Saint Vincent and the Grenadines | 73.1 |
Paraguay | 73.0 |
Tonga | 72.9 |
Belarus | 72.7 |
Cabo Verde | 72.6 |
Bangladesh | 72.2 |
Azerbaijan | 71.9 |
Vanuatu | 71.9 |
Libya | 71.8 |
Ukraine | 71.8 |
Dem. People’s Republic of Korea | 71.5 |
Republic of Moldova | 71.5 |
Egypt | 71.3 |
Suriname | 71.3 |
Uzbekistan | 71.2 |
Tajikistan | 70.9 |
Russian Federation | 70.9 |
Kyrgyzstan | 70.8 |
Trinidad and Tobago | 70.6 |
Solomon Islands | 70.4 |
Belize | 70.2 |
Fiji | 70.1 |
Nepal | 69.9 |
Syrian Arab Republic | 69.9 |
Bhutan | 69.8 |
Kazakhstan | 69.7 |
Iraq | 69.7 |
Western Sahara | 69.2 |
Mongolia | 69.1 |
Micronesia (Fed. States of) | 69.1 |
Indonesia | 69.0 |
Philippines | 69.0 |
Bolivia (Plurinational State of) | 68.8 |
Cambodia | 68.6 |
Timor-Leste | 68.6 |
India | 68.3 |
Turkmenistan | 67.7 |
Senegal | 66.8 |
Kenya | 66.7 |
Rwanda | 66.7 |
Sao Tome and Principe | 66.5 |
Myanmar | 66.5 |
Guyana | 66.5 |
Melanesia | 66.5 |
Pakistan | 66.3 |
Lao People’s Democratic Republic | 66.3 |
Kiribati | 66.1 |
Botswana | 65.8 |
Gabon | 65.7 |
Madagascar | 65.5 |
Papua New Guinea | 65.4 |
Ethiopia | 65.0 |
United Republic of Tanzania | 65.0 |
Yemen | 64.7 |
Eritrea | 64.6 |
Sudan | 64.3 |
Congo | 64.1 |
Namibia | 63.8 |
Comoros | 63.5 |
Afghanistan | 63.3 |
Mauritania | 63.1 |
Haiti | 63.1 |
Malawi | 62.7 |
Ghana | 62.4 |
Djibouti | 62.3 |
South Africa | 62.0 |
Liberia | 62.0 |
Zambia | 61.4 |
Angola | 61.2 |
Gambia | 61.0 |
Benin | 60.6 |
Zimbabwe | 60.4 |
Burkina Faso | 59.9 |
Togo | 59.9 |
Niger | 59.7 |
Uganda | 59.6 |
Guinea | 59.4 |
Democratic Republic of the Congo | 59.2 |
Mozambique | 57.7 |
Cameroon | 57.6 |
Mali | 57.5 |
Equatorial Guinea | 57.4 |
Burundi | 57.1 |
Swaziland | 57.1 |
Guinea-Bissau | 57.0 |
South Sudan | 56.3 |
Somalia | 55.9 |
Lesotho | 53.7 |
Côte d’Ivoire | 53.1 |
Nigeria | 53.0 |
Chad | 52.6 |
Central African Republic | 51.4 |
Sierra Leone | 51.4 |