ggplot2.utf8.md

Graphics with ggplot2

A large part of this chapter is taken from various works of Hadley Wickham. Among others The layered grammar of graphics and R for Data Science.

Why ggplot2?

Advantages of ggplot2

consistent underlying grammar of graphics (Wilkinson, 2005)
plot specification at a high level of abstraction
very flexible
theme system for polishing plot appearance
mature and complete graphics system
many users, active mailing list

Grammar of Graphics

In 2005 Wilkinson, Anand, and Grossman published the book “The Grammar of Graphics”. In it they laid out a systematic way to describe any graph in terms of basic building blocks. ggplot2 is an implementation of their ideas.

The use of the word grammar seems a bit strange here. The general dictionary meaning of the word grammar is:

the fundamental principles or rules of an art or science

so it is not only about language.

As our running example we will use the mtcars data set. It is part of base R and has information on 32 cars:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

Say we want to study the relationship of hp and mpg. So we have two quantitative variables, and therefore the obvious thing to do is a scatterplot. But there are a number of different ways we can this:

attach(mtcars)
par(mfrow=c(2, 2))
plot(hp, mpg, main="Basic Graph")
plot(hp, mpg, pch="x", main="Change Plotting Symbol")
plot(hp, mpg, cex=2, main="Change Size")
plot(hp, mpg, main="With Fit");abline(lm(mpg~hp))

The basic idea of the grammar of graphs is to separate out the parts of the graphs: there is the basic layout, there is the data that goes into it, there is the way in which the data is displayed. Finally there are annotations, here the titles, and possibly more things added, such as a fitted line. In ggplot2 you can always change one of these without worrying how that change effects any of the others.

Another way of looking at it this: in ggplot2 you build a graph like a lasagna: layer by layer.

Take the graph on the lower left. Here I made the plotting symbol bigger (with cex=2). But now the graph doesn’t look nice any more, the first and the last circle don’t fit into the graph. The only way to fix this is to start all over again, by making the margins bigger:

plot(hp, mpg, cex=2, ylim=range(mpg)+c(-1, 1))

and that is a bit of work because I have to figure out how to change the margins. In ggplot2 that sort of thing is taken care of automatically!

Let’s start by recreating the first graph above.

ggplot(mtcars, aes(hp, mpg)) + 
  geom_point()

this has the following logic:

ggplot sets up the graph
it’s first argument is the data set (which has to be a dataframe)
aes is the aestetic mapping. It connects the data to the graph
geom is the geometric object (circle, square, line) to be used in the graph. Here it is points.

Note ggplot2 also has the qplot command. This stands for quick plot

qplot(hp, mpg, data=mtcars)

This seems much easier at first (and it is) but the qplot command is also very limited. Very quickly you want to do things that aren’t possible with qplot, and so I won’t discuss it further here.

Note consider the following variation:

ggplot(mtcars) + 
  geom_point(aes(hp, mpg))

again it seems to do the same thing, but there is a big difference:

if aes(x, y) is part of ggplot, it applies to all the geom’s that come later (unless a different one is specified)
an aes(x, y) as part of a geom applies only to it.

How about the problem with the graph above, where we had to increase the y margin?

ggplot(mtcars, aes(hp, mpg)) + 
  geom_point(shape=1, size=5)

so we see that here this is done automatically.

Let’s say we want to identify the cars by the number of cylinders:

ggplot(mtcars, aes(hp, mpg, color=cyl)) + 
  geom_point()

Notice that the legend is a continuous color scale. This is because the variable cyl has values 4, 6, and 8, and so is identified by R as a numeric variable. In reality it is categorical (ever seen a car with 1.7 cylinders?), and so we should change that:

mtcars$faccyl <- factor(cyl, 
                       levels = c(4, 6, 8), 
                       ordered = TRUE) 
ggplot(mtcars, aes(hp, mpg, color=faccyl)) +
  geom_point()

we can also change the shape of the plotting symbols:

ggplot(mtcars, aes(hp, mpg, shape=faccyl)) + 
  geom_point()

or both:

ggplot(mtcars, aes(hp, mpg, shape=faccyl, color=faccyl)) +
  geom_point()

let’s pretty up the graph a bit with some labels and a title. We will be playing around with this graph for a while, so I will save some intermediate versions:

plt1 <- ggplot(mtcars, aes(hp, mpg, color=faccyl)) +
  geom_point() 
plt2 <- plt1 +
  labs(x = "Horsepower", 
       y = "Miles per Gallon", 
       color = "Cylinders",
       title = "Mileage goes down as Horsepower goes up")
plt2

Say we want to add the least squares regression lines for cars with the same number of cylinders:

plt3 <- plt2 +
  geom_smooth(method = "lm", se = FALSE)
plt3

There is another way to include a categorical variable in a scatterplot. The idea is to do several graphs, one for each value of the categorical variable. These are called facets:

plt3 + 
  facet_wrap(~cyl)

The use of facets also allows us to include two categorical variables:

mtcars$facgear <- 
  factor(gear, levels = 3:5, ordered = TRUE)
ggplot(mtcars, aes(hp, mpg, color=faccyl)) +
           geom_point(size = 1) +
  labs(x = "Horsepower", 
       y = "Miles per Gallon", 
       color = "Cylinders",
       title = "Mileage goes down as Horsepower goes up") + 
  geom_smooth(method = "lm", se = FALSE)

This is almost a bit to much, with just 32 data points there is not really enough for such a split.

Histograms

x <- rnorm(1000, 100, 30)
df3 <- data.frame(x = x)
bw <- diff(range(x))/50 # use about 50 bins
ggplot(df3, aes(x)) +
  geom_histogram(color = "black", 
                 fill = "white", 
                 binwidth = bw) + 
  labs(x = "x", y = "Counts")

Often we do histograms scaled to integrate to one. Then we can add the theoretical density and/or a nonparametric density estimate:

x <- seq(0, 200, length=250)
df4 <- data.frame(x=x, y=dnorm(x, 100, 30))
ggplot(df3, aes(x)) +
  geom_histogram(aes(y = ..density..), 
        color = "black", 
        fill = "white", 
        binwidth = bw) + 
  labs(x = "x", y = "Density") + 
  geom_line(data = df4, aes(x, y), 
            colour = "blue") +
  geom_density(color = "red")

Notice the red line on the bottom. This should not be there but seems almost impossible to get rid of!

Here is another interesting case: say we have two data sets and we wish to draw the two histograms, one overlaid on the other:

df5 <- data.frame(
  x = c(rnorm(100, 10, 3), rnorm(80, 12, 3)), 
  y = c(rep(1, 100), rep(2, 80)))          
ggplot(df5, aes(x=x)) + 
    geom_histogram(data = subset(df5, y == 1), 
        fill = "red", alpha = 0.2) +
    geom_histogram(data = subset(df5, y == 2), 
        fill = "blue", alpha = 0.2)

Notice the use of alpha. In general this “lightens” the color so we can see “behind”.

Boxplots

y <- rnorm(120, 10, 3)
x <- rep(LETTERS[1:4], each=30)
y[x=="B"] <- y[x=="B"] + rnorm(30, 1)
y[x=="C"] <- y[x=="C"] + rnorm(30, 2)
y[x=="D"] <- y[x=="D"] + rnorm(30, 3)
df6 <- data.frame(x=x, y=y)
ggplot(df6, aes(x, y)) + 
  geom_boxplot()

strangely enough doing a boxplot without groups takes a little work. We have to “invent” a categorical variable:

ggplot(df6, aes(x="", y)) + 
  geom_boxplot() + 
  xlab("")

There is a modern version of this graph called a violin plot:

ggplot(df6, aes(x="", y)) + 
  geom_violin() + 
  xlab("")

Barcharts

x <- sample(LETTERS[1:5], 
            size = 1000, 
            replace = TRUE, 
            prob = 6:10)
df7 <- data.frame(x=x)
ggplot(df7, aes(x)) + 
  geom_bar(alpha=0.75, fill="lightblue") +
  xlab("")

Say we want to draw the graph based on percentages. Of course we could just calculate them and then do the graph. Here is another way:

ggplot(df7, aes(x=x)) + 
  geom_bar(aes(y=100*(..count..)/sum(..count..)),
      alpha = 0.75, 
      fill = "lightblue") +
  labs(x="", y="Percentages")

Notice how this works: in geom_bar we use a new aes, but the values in it are calculated from the old data frame.

Finally an example of a contingency table.

df7$y <- sample(c("X", "Y"), 
                size = 1000, 
                replace = TRUE, 
                prob = 2:3)
ggplot(df7, aes(x=x, fill = y)) + 
  geom_bar(position = "dodge") + 
    scale_y_continuous(labels=scales::percent) +
    labs(x="", y="Percentages", fill="Y")

Axis Ticks and Legend Keys

Let’s return to the basic plot of mpg by hp. Let’s say we want to change the axis tick marks:

ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  scale_x_continuous(breaks = seq(50, 350, by=25)) +
  scale_y_continuous(breaks = seq(0, 50, by=10))

sometimes we want to do graphs without any tick labels. This is useful for example for maps and also for confidential data, so the viewer sees the relationship but can’t tell the sizes:

ggplot(mtcars, aes(hp, mpg)) +
  geom_point() +
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = NULL)

By default ggplot2 draws the legends on the right. We can however change that. We can also change the appearance of the legend. Recall that the basic graph is in plt2. Then

plt2 +
  theme(legend.position = "bottom") +
  guides(color=guide_legend(nrow = 1, 
                           override.aes = list(size=4)))

Special Symbols in Labels etc.

Consider this boxplot:

plt <- ggplot(df, aes(x, y)) +
 geom_boxplot() +
  labs(x="")
plt

Now it turns out that this are measurements of some chemical concentrations, and the name of A is really \(N_0\pi\) and of B it is \(\alpha Co^2\). It would be nice to use these in the graph:

plt +
 scale_x_discrete(labels =
         c(expression(paste("N"[0], pi)),
           expression(paste(alpha, "Co"^2))))

Unfortunately the exact syntax differs when changing labels, annotations, titles etc, and it can sometimes take a bit of work to find the right one.

Saving the graph

It is very easy to save a ggplot2 graph. Simply run

ggsave("myplot.pdf")

it will save the last graph to disc.

One issue is figure sizing. You need to do this so that a graph looks “good”. Unfortunately this depends on where it ends up. A graph that looks good on a web page might look ugly in a pdf. So it is hard to give any general guidelines.

If you use R markdown, a good place to start is with the chunk arguments fig.with=6 and out.width=“70%”. In fact on top of every R markdown file I have a chunk with

library(knitr)  
opts_chunk$set(fig.width=6, 
               fig.align = "center",  
               out.width = "70%", 
               warning=FALSE, 
               message=FALSE)

so that automatically every graph is sized that way. I also change the default behavior of the chunks to something I like better!