Data Input/Output, Transferring R Objects

Printing Info to Screen

The basic functions to display information on the screen are

options(digits=4)
x <- rnorm(5, 100, 30)
print(x)
## [1] 117.96  82.75  84.21  86.88  68.30
cat(x)
## 118 82.75 84.21 86.88 68.3

Both of these have certain advantages:

  • with print you can easily control the number of digits:
print(x, digits=6)
## [1] 117.9571  82.7533  84.2148  86.8807  68.3047
  • with cat you can easily mix text and numeric:
cat("The mean is ", round(mean(x), 1), "\n")
## The mean is  88

The “\n” (newline) is needed so that the cursor moves to the next line. This also sometimes referred as a carriage return.

Another advantage of cat is that one can have different rounding for different numbers:

x <- 1100; y <- 0.00123334
print(c(x, y), 4)
## [1] 1.100e+03 1.233e-03
cat(x, "  ", round(y, 4), "\n")
## 1100    0.0012

Notice that in the case of print R switches to scientific notation. This default behavior can be changed with

options(scipen=999)
print(c(x, y), 4)
## [1] 1100.000000    0.001233
options(scipen=0)
print(c(x, y), 4)
## [1] 1.100e+03 1.233e-03

Some times you need a high level of control over the output, for example when writing data to a file that then will be read by a computer program that wants things just so. For this you can use the sprintf command.

sprintf("%f", pi)
## [1] "3.141593"

Here the f stands for floating point, the most common type. Also note that the result of a call to sprintf is a character vector.

Here are some variations:

sprintf("%.3f", pi) # everything before the ., 3 digits after
## [1] "3.142"
sprintf("%1.0f", pi) # 1 space, 0 after
## [1] "3"
sprintf("%5.1f", pi) # 5 spaces total, 1 after
## [1] "  3.1"
sprintf("%05.1f", pi) # same but fill with 0
## [1] "003.1"
sprintf("%+f", pi) # all with + in front
## [1] "+3.141593"
sprintf("% f", pi) # space in front
## [1] " 3.141593"
sprintf("%e", pi) # in scientific notation, small e
## [1] "3.141593e+00"
sprintf("%E", pi) # or large E
## [1] "3.141593E+00"
sprintf("%g", 1e6*pi)
## [1] "3.14159e+06"



Here is another example. In Statistics we often find a p value. These should generally be quoted to three digits. But when the p value is less than \(10^{-3}\) R uses scientific notation. If you want to avoid that do this

x <- 1100; pval <- 0.00123334
c(x, pval)
## [1] 1.100e+03 1.233e-03
sprintf("%.3f", c(x, pval))
## [1] "1100.000" "0.001"

Reading in a Vector

Often the easiest thing to do is to use copy-paste the data and then simply scan it into R:

x <- scan("clipboard")

Note: if you are using a Mac you need to use

x <- scan(pipe("pbpaste"))
  • use the argument sep=“;” to change the symbol that is being used as a separator. The default is empty space, common cases include comma, semi-colon, and newline (\n)

  • scan assumes that the data is numeric, if not use the argument what=“char”.


I need to do this so often I wrote a little routine for it:

getx <- function(sep="") {
  options(warn=-1) # It might give a warning, I don't care
  x <- scan("clipboard", what="character", sep=sep)
  # always read as character
  if(all(!is.na(as.numeric(x)))) # are all elements numeric?
    x <- as.numeric(x) # then make it numeric
  options(warn=0) # reset warning
  x  
}

Notice some features:

  • the routine always reads the data as a character vector, whether it is character or numeric.

  • it then tries to turn it into numeric. If that works, fine, otherwise it stays character. This is done with as.numeric(x), which returns NA if it can’t turn an entry into numeric, so is.na(as.numeric(x)) returns TRUE if x can’t be made numeric.

  • when trying to turn a character into a number R prints a warning. This is good in general to warn you that you are doing something strange. Here, though, it is expected behavior and we don’t need the warning. The routine suppresses them by setting options(warn=-1), and setting it back to the default afterwards.

If the data is in a stand-alone file saved on your hard drive you can also read it from there:

x <- scan("c:/folder/file.R")

If it is a webpage use

x <- scan(url("http://somesite.html"))

Notice the use of / in writing folders. \ does not work on Windows because it is already used for other things, \\ would work but is more work to type!

scan has a lot of arguments:

args(scan)
## function (file = "", what = double(), nmax = -1L, n = -1L, sep = "", 
##     quote = if (identical(sep, "\n")) "" else "'\"", dec = ".", 
##     skip = 0L, nlines = 0L, na.strings = "NA", flush = FALSE, 
##     fill = FALSE, strip.white = FALSE, quiet = FALSE, blank.lines.skip = TRUE, 
##     multi.line = TRUE, comment.char = "", allowEscapes = FALSE, 
##     fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) 
## NULL

the most useful are

  • what
  • sep
  • nmax: maximum number of lines to read, useful if you don’t know just how large the file is and want to read just some of it to check it out.
  • skip: number of lines at beginning to skip, for example if there is some header.
  • quiet=FALSE: by default R will say how many lines have been read, this can be a nuisance if you have a routine that reads in many files.
  • blank.lines.skip=TRUE: does not read in empty lines. This is a problem if you want to write the file out again later and want it to look as much as possible like the original.

Example: A non-standard input format.

Consider the file at

x <- scan("http://academic.uprm.edu/wrolke/esma6835/sales.txt")
for(i in 1:5) cat(x[i],"\n")
    152 278,11     202 998,05      89 060,44 
  1 803 360,69      24 608,24      49 004,89 
    812 679,54     186 289,80      95 946,42 
    171 266,69     208 691,32      28 503,93 
     56 646,34      41 287,10      15 483,96 

these are sales data for some store. We want to find the mean sales amount. So we need to read the data into R, but there are some issues:

  • the data delimits the decimals European-style, using a comma instead of a period.

  • for easier readability the million and the thousand are separated by a space.

so the first number really is 152278.11.

How can we do this? To start we need to read the data as a single character string:

x <- paste0()
  scan("http://academic.uprm.edu/wrolke/esma6835/sales.txt",
   sep="\n"), collapse="")

Let’s see what we have, at least at the beginning:

substring(x, 1, 100)
## [1] "    152 278,11     202 998,05      89 060,44  1 803 360,69      24 608,24      49 004,89    812 679,"

Next we can replace the , with .:

x <- gsub(",", "\\.", x)
substring(x, 1, 100)
## [1] "    152 278.11     202 998.05      89 060.44  1 803 360.69      24 608.24      49 004.89    812 679."

notice the \\. This is needed because . is a special character in R, it actually needs to be escaped twice!

Next notice that the numbers are always separated by at least two spaces, so we can split them up with

x <- strsplit(x, "  ")[[1]]
x[1:10]
##  [1] ""             ""             "152 278.11"   ""            
##  [5] " 202 998.05"  ""             ""             "89 060.44"   
##  [9] "1 803 360.69" ""

Now we can remove any spaces:

x <- gsub(" ", "", x)
x[1:10]
##  [1] ""           ""           "152278.11"  ""           "202998.05" 
##  [6] ""           ""           "89060.44"   "1803360.69" ""

and get rid of the "":

x <- x[x!=""]
x[1:10]
##  [1] "152278.11"  "202998.05"  "89060.44"   "1803360.69" "24608.24"  
##  [6] "49004.89"   "812679.54"  "186289.80"  "95946.42"   "171266.69"

Almost done:

x <- as.numeric(x)
mean(x)
## [1] 198450

Importing a Data Frame

the standard command to read data from a table into a data frame is read.table.

x <- read.table("c:/folder/file.R")

it has many of the same arguments as scan (for example sep). It also has the argument header=FALSE. If your table has column names use header=TRUE. The same for row names.

Example:

say the following data is saved in a file named student.data.R:

ID Age GPA Gender
63368 22 2.9 Male
75382 22 2.6 Female
43337 18 2.7 Male
56341 18 2.8 Male
43988 19 3.9 Male
47648 21 2.6 Female
10959 19 3.3 Male
57902 25 2.6 Female
48890 20 3.6 Female
18430 22 3.2 Female

Now we can use

read.table("c:/folder/student.data.R", 
        header=TRUE, row.names = 1)

the row.names=1 tells R to use the first column as row names.

Transferring Objects from one R to another

Say you have a few data sets and routines you want to send to someone else. The easiest thing to do is use dump and source.

dump(c("data1", " data2", "fun1"), "c:/folder/mystuff.R")

Now to read in the stuff simply use

source("c:/folder/mystuff.R")

I often have to transfer stuff from one R project to another, so I wrote myself these two routines:

dp <- function (x) dump(x, "clipboard")
sc <- function () source("clipboard")

Special File Formats

There are routines to read all sorts of file formats. The most important one is likely read.csv, which can read Excel files saved in the comma delimited format.

Packages

there are a number of packages written to help with data I/O. We will discuss some of them later.

Working on Files

R can also be used to create, copy, move and delete files and folders on your hard drive. The routines are

dir.create(…)
dir.exists(…)
file.create(…)
file.exists(…)
file.remove(…)
file.rename(from, to)
file.append(file1, file2)
file.copy(from, to)

You can also get a listing of the files in a folder:

head(dir("c:/R"))
## [1] "bin"     "CHANGES" "COPYING" "doc"     "etc"     "include"

for the folder from which R started use

head(dir(getwd()))
## [1] "_book"               "_bookdown.yml"       "_bookdown_files"    
## [4] "_main.Rmd"           "_main_files"         "Additional Material"