History, Basic Usage

S

S is a statistical programming language developed primarily by John Chambers and (in earlier versions) Rick Becker and Allan Wilks of Bell Laboratories. The aim of the language, as expressed by John Chambers, is “to turn ideas into software, quickly and faithfully”.

The first working version of S was built in 1976, and operated on the GCOS operating system. At this time, S was unnamed, and suggestions included Interactive SCS (ISCS), Statistical Computing System, and Statistical Analysis System (which was already taken: see SAS System). The name ‘S’ (used with single quotation marks, until 1979) was chosen, as it has the common letter used in statistical computing, and is consistent with other programming languages designed from the same institution at the time (namely the C programming language).

By 1988, many changes were made to S and the syntax of the language. The New S Language (1988 Blue Book) was published to introduce the new features, such as the transition from macros to functions and how functions can be passed to other functions (such as apply).

In the late 1990s AT&T sold the S license to a private company that started to charge a lot of money for it. This lead to

R

R is an implementation of the S programming language combined with lexical scoping semantics inspired by Scheme. There are some important differences between S and R, but much of the code written for S runs unaltered.

R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. R is named partly after the first names of the first two R authors and partly as a play on the name of S. The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000.

RStudio

RStudio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. RStudio was founded by JJ Allaire, creator of the programming language ColdFusion. Hadley Wickham is the Chief Scientist at RStudio.

Work on RStudio started at around December 2010, and the first public beta version (v0.92) was officially announced in February 2011. Version 1.0 was released on 1 November 2016. Version 1.1 was released on 9 October 2017.

R can be run by itself using the command line interface, but it is usually much nicer to use RStudio as a front end. Considering that it is only about three years old it has taken the R world by storm!

Getting Started

Once R (or Rstudio) is opened you will see the >. This is the command prompt. It means R is waiting for you to do something!

Don’t like >? No sweat, you can change it (although I don’t recommend changing the prompt). Here is how:

options()(prompt="@")

In fact, you can change almost anything with the options command.

length(options())
## [1] 67

shows there are 67 of them. Two you might want to remember are

options()["digits"]
## $digits
## [1] 7
pi
## [1] 3.141593
options(digits=5)
pi
## [1] 3.1416
options(digits=15) 
pi
## [1] 3.14159265358979

and

options()$warn
## [1] 0

this tells R to warn about certain things. Those are not errors, just warnings, and usually they are ok. Sometimes though they are not needed and can be a nuisance, and then you can turn them off with

options(warn=-1)

Changing one or more of these options is something one needs to do on occasion inside a function. If so you likely want to reset them to the defaults before leaving the function. Say we have a program that prints results to the screen and we want all the numbers to have 4 digits. Then we can use

# At start of function:
default.options <- options()
options(digits=4)
# now do stuff
# then at the end of the program
options(default.options)

Saving your Stuff

The standard setup is to have separate folders for each project. When starting a new one RStudio creates the folder and within the files

  • foldername.Rproj is the RStudio project file
  • .RData is the R worksheet

In the past the .RData file was the “heart” of R, and typically there was one for each project, with different names. RStudio wants you to essentially forget about it, there is actually an option in Tools > Global Options to just ignore any RData file. If you do need to create a .RData file (maybe to send stuff to someone else) use

save.image("c:/folder/filename.RData")

Notice the use of the backslash as a separator for folders. On Windows systems one usually uses forward slashes (“\”) but these have a special use in R as the “escape character”. One alternative is to use

save.image("c:\\folder\\filename.RData")

but / is generally preferred.

Assignments

The basic assignment character in R is <-

x <- 1
x
## [1] 1

Note think of an arrow pointing left

Note = also works:

x = 2
x
## [1] 2

but mainly for reasons of backward compatibility <- is generally recommended.

RStudio has a lot of useful shortcuts, one of them is ALT - , which enters <- and even the spaces around it!

Check out the list of keyboard shortcuts at Tools > Keyboards shortcuts.

Help!

Help inside of R

R has a help command. Let’s say we want to find out about the mean command:

?mean

This opens a page in the Help pane with all sorts of information regarding the mean command.

Another useful function is arg, which lists the arguments of a routine:

args(splot)
## function (y, x, z, w, use.facets = FALSE, add.line = 0, plot.points = TRUE, 
##     jitter = FALSE, errorbars = FALSE, label_x, label_y, label_z, 
##     main_title, add.text, add.text_x, add.text_y, plotting.symbols = NA, 
##     plotting.size = 1, plotting.colors = NA, ref_x, ref_y, log_x = FALSE, 
##     log_y = FALSE, no.legends = FALSE, return.graph = FALSE) 
## NULL

splot is not a command that is part of R but one I wrote myself to make nice looking scatterplots. Unfortunately args often doesn’t work well for basic R commands:

args(mean)
## function (x, ...) 
## NULL

Help from Outside of R

R has one of the most active and friendly user communities anywhere. So if you run into a problem do this:

Google!

Most of the problems you have, someone else already had. They ended up asking for help on the internet, and someone replied. So all you need to do is find that reply. Start with a google search. Typically you want to include R in your search terms. If at first you don’t succeed, try, try again!

Stackoverflow

Many times your google search will lead you to stackoverflow. This is a website for people working on computer coding problems to help each other. It is not specific to R but the R community is using it to discuss and solve R questions.

If you can not find a solution to your problem, eventually you might want to post a question on stackoverflow. To do so you have to become a member, which is free.

Here are some guidelines for posting questions:

  • did you really try hard enough to find an answer on your own? It is very embarrassing when you post a question and the reply is a link to a simple Google search that has the answer already.

  • try and eliminate anything that does not have to do with your problem.

  • produce a simple and short example of your problem. Anything over 5 or 10 lines of code is almost certainly to long.

ls, rm

use the ls command to get a listing of the current objects in the worksheet:

ls()[1:5]
## [1] "%!in%"    "acorn"    "agesex"   "agesexUS" "aids"

If there are many objects (the worksheet I have open right now has over 250!) you can specify part of the name:

ls(pattern = "^sa")
## [1] "salaries"

the ^ (called “caret”) in front of sa means that ls will list all objects whose name starts with sa. This is what is called a regular expression. These are general (non R) rules for specifying patters. They are a science all their own, and we will need to have a look at them at some point.

If you need some more details on the objects you can use

ls.str(pattern = "^sa")
## salaries : 'data.frame': 25 obs. of  3 variables:
##  $ Salary: int  21700 24000 23800 26000 27200 25100 26700 27600 28300 29100 ...
##  $ Years : int  3 9 10 11 12 4 6 7 9 11 ...
##  $ Level : chr  "Low" "Low" "Low" "Low" ...

Often you create an object that is only needed for a short time. If you want to get rid of it use rm (remove)

x <- 1:10
sum(x^3)
## [1] 3025
rm(x)

you can also remove many objects in one step. Let’s say we have worked on a problem for a while and we created a number of objects. All of them start with “my”. Now we can do this

a <- ls(pattern = "^my")
rm(list=a)

I have a routine called cleanup:

cleanup <- function () 
{
  options(warn=-1)
  oldstuff <- c("f", "f1", "F","n", "m", "a", "A", "b", "B", 
        "i", "j", "plt", "plt1", "plt2", "x", "xx", "y", 
        "z", "u", "v", "xy", "out", "mu")
  rm(list=oldstuff, envir = .GlobalEnv)
  options(warn = 0)
  
}
  • the things in oldstuff (like f and x) are the names I generally use for stuff that I expect to need only for a short time.

  • it has the options(warn=-1) command because many of these names will not be present when I run the cleanup command, and each of those would result in a warning. I already know this, so I don’t need the warning. At the end I reset to the default with options(warn = 0).

  • notice the empty line before the end of the function. This means there will be no return result. Without it the routine would return NULL.

  • notice the argument envir. We will talk about what that means soon.

Naming Conventions

It is good practice to have a consistent style when choosing names for objects, functions etc. There is a detailed description at https://google.github.io/styleguide/Rguide.xml. Here are the most important ones:

  • File Names

File names should end in .R and, of course, be meaningful.

GOOD: predict_ad_revenue.R
BAD: foo.R

  • Identifiers

Don’t use underscores ( _ ) or hyphens ( - ) in identifiers. Identifiers should be named according to the following conventions.

The preferred form for variable names is all lower case letters and words separated with dots.

GOOD: avg.clicks
BAD: avg_Clicks

Make function names verbs.

GOOD: calculate.avg.clicks
BAD: clicks

constants are named like functions but with an initial k.

kbins <- 20
krun <- 1000
  • Spacing

Place spaces around all binary operators (=, +, -, <-, etc.). Exception: Spaces around =’s are optional when passing parameters in a function call.

GOOD: x <- 1
BAD: x<-1

  • Do not place a space before a comma, but always place one after a comma.

GOOD: x <- c(1, 2, 3)
BAD: x <- c(1,2,3)



These conventions always have a bit of personal style, which one might like or not. Here is another style rule:

Place a space before left parenthesis, except in a function call.

GOOD: if (debug)
BAD: if(debug)

Now personally I really don’t like this one and I don’t use it!

Instead of the . between words many people prefer to use camel back:

calculateMeans

That is just fine, I would recommend however you stick with one or the other.