graphs.utf8.md

Graphs

In this section we will discuss some graphs that are part of base R. Later we will study more advanced graphics, but it is still a good idea to know how to draw a base graph.

Barcharts

Say we have the data on the number of freshman, sophomores etc at some college:

head(students)

## [1] "Junior"    "Freshman"  "Freshman"  "Sophomore" "Freshman"  "Sophomore"

So here we have a categorical variable. One popular graph for this type of data is the pie chart,

BUT repeat after me: pie charts are evil!!

A MUCH better alternative are bar charts:

barplot(table(students))

Note that the argument to bar plot has to be a table.

But this is not quite right, Sophomores should come second. In general, when we have a categorical variable which has an ordering we should turn it into a factor:

students.fac <- 
  factor(students, 
         levels = c("Freshman", "Sophomore", 
                 "Junior", "Senior"), 
         ordered = TRUE)
tbl.students.fac <- table(students.fac)
barplot(tbl.students.fac)

There are a number of arguments we find in most graphics routines:

barplot(tbl.students.fac, 
        main = "Our Students",
        xlab = "Class", 
        ylab = "Counts")

and then there are arguments specific to the graph:

barplot(tbl.students.fac, 
        main = "Our Students",
        xlab = "Counts", 
        ylab = "", 
        horiz = TRUE, 
        density = 10,
        col = c("lightblue", "mistyrose",
                "lightcyan", "lavender"))

Notice that we can’t see all the y labels. It would be better to have them horizontal as well. But we also need to make space for them:

par(mai=c(1,2,1,1))
barplot(tbl.students.fac, 
        main = "Our Students",
        xlab = "Counts", 
        ylab = "", 
        horiz = TRUE, 
        density = 10,
        las = 1,
        col = c("lightblue", "mistyrose",
                "lightcyan", "lavender"))

Here I used the par command to set some parameters for how the graph is drawn. There is a very long list of such parameters, see the par help file.

Histogram

x <- rnorm(1000, 100, 20)
hist(x, 
     breaks = 50, 
     main = "IQ",
     xlab = "IQ", 
     ylab = "Counts")

The hist command is useful for its side effect of counting:

range(x)

## [1]  39.46882 170.45319

hist(x, 
  breaks = c(0, 50, 80, 100, 120, 150, 250), 
  plot = FALSE)$counts

## [1]   5 152 357 331 148   7

Often we want to compare a histogram with a theoretical curve, say a probability density. Then we can use the curve command with the argument add=TRUE:

x <- c(rnorm(1500, 100, 20), 
       rnorm(1000, 140, 10))
hist(x, 
     breaks = 50, 
     main = "", 
     freq=FALSE)
f <- function(x) 
   1.5*dnorm(x, 100, 20) + dnorm(x, 140, 10)
I <- integrate(f, 0, 200)$value
curve(f(x)/I, from=min(x), to=max(x), 
      add = TRUE, 
      col = "blue", 
      lwd = 2)

Empirical Distribution Function

plot(ecdf(x))

Boxplot

boxplot(x, horizontal = TRUE)

boxplot(Length~Status, data=mothers)

Normal Probability Plot

qqnorm(x)
qqline(x)

Scatterplot

plot(faithful$Eruptions,
     faithful$Waiting.Time, 
     xlab = "Length of last eruption",
     ylab = "Time until next eruption", 
     main = "Old Faithful Guyser")

Let’s add the least squares regression fit to the graph:

plot(faithful$Eruptions,
     faithful$Waiting.Time,  
     xlab = "Length of last eruption",
     ylab = "Time until next eruption", 
     main = "Old Faithful Guyser")
abline(lm(Waiting.Time~Eruptions,
          data=faithful), lwd=2, lty=2)

There are a lot of different plotting symbols:

plot(0:16, rep(0, 17), ylim=c(0,1), 
     type="n", xlab="", ylab="", axes=F)
for(i in 1:15) {
  for(k in 1:3) {
    points(i, c(0.8, 0.5, 0.2)[k], pch=i+(k-1)*15)
    text(i, c(0.9, 0.6, 0.3)[k], label=i+(k-1)*15, adj = 0) 
  }
}

Notice the setup of an empty graph!

Also, different line types:

plot(0:7, rep(0, 8), ylim=c(0,1), 
     type="n", xlab="", ylab="", axes=F)
for(i in 1:6) {
  segments(i, 0.2, i, 1, lty=i)
  text(i, 0.1, i)
}

Consider the data set mtcars, which is part of base R. It has data on 32 different cars:

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

say we wish to study the relationship of mpg (Miles per Gallon) and hp (Horsepower) but also include information on the number of cylinders (either 4, 6 or 8):

cols <- rep("blue", 32)
cols[mtcars$cyl==6] <- "green"
cols[mtcars$cyl==8] <- "red"
plot(mtcars$hp, mtcars$mpg, 
     xlab = "Horsepower", 
     ylab = "Miles per Gallon",
     col = cols, pch = 20)
legend(250, 33, 
       paste(c(4, 6, 8), "Cylinders"),
       pch = 20, 
       col = c("blue", "green", "red"))

Exercise

Do the graph again, but use the cylinder numbers as plotting symbols:

Let’s add the least squares regression lines for each cylinder separately. We could use the abline(fit) command, but notice that the hp ranges for the three cylinder types are quite different. Using abline would draw the fit all the way across the graph. Here is a better solution:

cls <- c("blue", "green", "red")
plot(mtcars$hp, mtcars$mpg, 
     xlab = "Horsepower", 
     ylab = "Miles per Gallon",
     col = cols, pch = 20)
legend(250, 33, 
       paste(c(4, 6, 8), "Cylinders"),
       pch = 20, 
       col = cls)
for(i in c(4, 6, 8)) {
  fit <- lm(mpg[cyl==i]~hp[cyl==i],
            data=mtcars)
  x <- range(mtcars$hp[mtcars$cyl==i])
  y <- coef(fit)[1] + coef(fit)[2]*x
  segments(x[1], y[1], x[2], y[2],
           col=cls[(i-2)/2])
}

Multiple Graphs

sometimes we want to combine several graphs into one

par(mfrow=c(2, 2))
hist(rnorm(1000), 50, 
     main="Normal Distribution",
     ylab="", xlab="")
hist(rt(1000, 2), 50, 
     main="t Distribution, df=2",
     ylab="", xlab="")
hist(runif(1000), 50, 
     main="Uniform Distribution",
     ylab="", xlab="")
hist(rchisq(1000, 2), 50, 
     main = expression(paste(chi^2, 
            " Distribution, df=2")),
     ylab="", xlab="")

This works fine for rectangular arrays. For more complicated graphs we have the layout command. Let’s create a scatterplot of mpg by hp with marginal boxplots:

layout(matrix(c(2, 0, 1, 3), 2, 2, 
              byrow = TRUE), 
             c(3, 1), c(1, 3), TRUE)
par(mar = c(3, 3, 1, 1))
plot(mtcars$hp, mtcars$mpg)
par(mar = c(0, 3, 1, 1))
boxplot(mtcars$hp, 
        axes = FALSE, horizontal = TRUE)
par(mar = c(3, 0, 1, 1))
boxplot(mtcars$mpg, axes = FALSE)

Functions that create graphs

When I write a paper or a talk or anything else that requires some graphs I always write a routine plot.mytalk, which has everything to recreate every graph. This is a great way to not only make changes if necessary but also to remind myself what I did back years ago when I wrote it.

So I might have the following function to do the graph above:

figure1 <- function() {
  layout(matrix(c(2, 0, 1, 3), 2, 2, 
                byrow = TRUE), 
         c(3, 1), c(1, 3), TRUE)
  par(mar = c(3, 3, 1, 1))
  plot(hp, mpg)
  par(mar = c(0, 3, 1, 1))
  boxplot(hp, axes = FALSE, horizontal = TRUE)
  par(mar = c(3, 0, 1, 1))
  boxplot(mpg, axes = FALSE)
  
}

Note these days I often do this with Rmarkdown instead.

Graphs that don’t look like Graphs

we can use R to make other types of pictures. Consider this diagram, which I use to illustrate the topic of Cause vs Effect:

it is actually done with these commands:

plot(c(10, 100), c(40, 100), axes=F, 
     xlab="", ylab="", type="n")
text(15, 90, "Cause-Effect", 
     cex=1.2, adj = 0)
text(c(20, 40), c(65, 65), c("X", "Y"), 
     cex=1.1, adj=0)
arrows(25, 65, 38, 65)
text(56, 90, "Confounding Variable", 
     cex=1.2, adj=0)
text(c(60, 70, 80), c(50, 75, 50), c("X", "Z", "Y"),
     cex=1.1, adj=0)
arrows(70, 70, 62, 55)  
arrows(70, 70, 78, 55)

An important example of this are maps:

library(maps)
map("usa")
text(-87.623177, 41.881832, "Chicago", cex=1.3)

People have done some very funny things with this. Run this code in the console (you need to install the library cowsay first)

library(cowsay)
say("Hello world!", by = "cow")

Printing Graphs

Often we need to save a graph as a png or a postscript file, so we can include it in a webpage or a latex document. I have to do this often enough I wrote a routine for it:

graph.out <-
function (f, foldername, graphname, format = "png") 
{
  file <- paste0(foldername, graphname, ".", format)
  cat(file)
  if (format == "png") 
      png(file)
  if (format == "pdf") 
      pdf(file)
  if (format == "ps") 
      postscript(file, horizontal = F, pointsize = 17)
  if (format == "eps") {
      setEPS()
      postscript(file, horizontal = F, pointsize = 17)
  }
  f()
  dev.off()
  
}

here f is a function that produces a graph, so it might be called like this:

f <- function(x) hist(rnorm(1000), 50)
graph.out(f, "C:/mygraphs", "myhist")

and now there should be a file myhist.png in the folder C:/mygraphs.