In this section we will discuss some graphs that are part of base R. Later we will study more advanced graphics, but it is still a good idea to know how to draw a base graph.
Say we have the data on the number of freshman, sophomores etc at some college:
head(students)
## [1] "Junior" "Freshman" "Freshman" "Sophomore" "Freshman" "Sophomore"
So here we have a categorical variable. One popular graph for this type of data is the pie chart,
BUT repeat after me: pie charts are evil!!
A MUCH better alternative are bar charts:
barplot(table(students))
Note that the argument to bar plot has to be a table.
But this is not quite right, Sophomores should come second. In general, when we have a categorical variable which has an ordering we should turn it into a factor:
students.fac <-
factor(students,
levels = c("Freshman", "Sophomore",
"Junior", "Senior"),
ordered = TRUE)
tbl.students.fac <- table(students.fac)
barplot(tbl.students.fac)
There are a number of arguments we find in most graphics routines:
barplot(tbl.students.fac,
main = "Our Students",
xlab = "Class",
ylab = "Counts")
and then there are arguments specific to the graph:
barplot(tbl.students.fac,
main = "Our Students",
xlab = "Counts",
ylab = "",
horiz = TRUE,
density = 10,
col = c("lightblue", "mistyrose",
"lightcyan", "lavender"))
Notice that we can’t see all the y labels. It would be better to have them horizontal as well. But we also need to make space for them:
par(mai=c(1,2,1,1))
barplot(tbl.students.fac,
main = "Our Students",
xlab = "Counts",
ylab = "",
horiz = TRUE,
density = 10,
las = 1,
col = c("lightblue", "mistyrose",
"lightcyan", "lavender"))
Here I used the par command to set some parameters for how the graph is drawn. There is a very long list of such parameters, see the par help file.
x <- rnorm(1000, 100, 20)
hist(x,
breaks = 50,
main = "IQ",
xlab = "IQ",
ylab = "Counts")
The hist command is useful for its side effect of counting:
range(x)
## [1] 39.46882 170.45319
hist(x,
breaks = c(0, 50, 80, 100, 120, 150, 250),
plot = FALSE)$counts
## [1] 5 152 357 331 148 7
Often we want to compare a histogram with a theoretical curve, say a probability density. Then we can use the curve command with the argument add=TRUE:
x <- c(rnorm(1500, 100, 20),
rnorm(1000, 140, 10))
hist(x,
breaks = 50,
main = "",
freq=FALSE)
f <- function(x)
1.5*dnorm(x, 100, 20) + dnorm(x, 140, 10)
I <- integrate(f, 0, 200)$value
curve(f(x)/I, from=min(x), to=max(x),
add = TRUE,
col = "blue",
lwd = 2)
plot(ecdf(x))
boxplot(x, horizontal = TRUE)
boxplot(Length~Status, data=mothers)
qqnorm(x)
qqline(x)
plot(faithful$Eruptions,
faithful$Waiting.Time,
xlab = "Length of last eruption",
ylab = "Time until next eruption",
main = "Old Faithful Guyser")
Let’s add the least squares regression fit to the graph:
plot(faithful$Eruptions,
faithful$Waiting.Time,
xlab = "Length of last eruption",
ylab = "Time until next eruption",
main = "Old Faithful Guyser")
abline(lm(Waiting.Time~Eruptions,
data=faithful), lwd=2, lty=2)
There are a lot of different plotting symbols:
plot(0:16, rep(0, 17), ylim=c(0,1),
type="n", xlab="", ylab="", axes=F)
for(i in 1:15) {
for(k in 1:3) {
points(i, c(0.8, 0.5, 0.2)[k], pch=i+(k-1)*15)
text(i, c(0.9, 0.6, 0.3)[k], label=i+(k-1)*15, adj = 0)
}
}
Notice the setup of an empty graph!
Also, different line types:
plot(0:7, rep(0, 8), ylim=c(0,1),
type="n", xlab="", ylab="", axes=F)
for(i in 1:6) {
segments(i, 0.2, i, 1, lty=i)
text(i, 0.1, i)
}
Consider the data set mtcars, which is part of base R. It has data on 32 different cars:
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
say we wish to study the relationship of mpg (Miles per Gallon) and hp (Horsepower) but also include information on the number of cylinders (either 4, 6 or 8):
cols <- rep("blue", 32)
cols[mtcars$cyl==6] <- "green"
cols[mtcars$cyl==8] <- "red"
plot(mtcars$hp, mtcars$mpg,
xlab = "Horsepower",
ylab = "Miles per Gallon",
col = cols, pch = 20)
legend(250, 33,
paste(c(4, 6, 8), "Cylinders"),
pch = 20,
col = c("blue", "green", "red"))
Exercise
Do the graph again, but use the cylinder numbers as plotting symbols:
Let’s add the least squares regression lines for each cylinder separately. We could use the abline(fit) command, but notice that the hp ranges for the three cylinder types are quite different. Using abline would draw the fit all the way across the graph. Here is a better solution:
cls <- c("blue", "green", "red")
plot(mtcars$hp, mtcars$mpg,
xlab = "Horsepower",
ylab = "Miles per Gallon",
col = cols, pch = 20)
legend(250, 33,
paste(c(4, 6, 8), "Cylinders"),
pch = 20,
col = cls)
for(i in c(4, 6, 8)) {
fit <- lm(mpg[cyl==i]~hp[cyl==i],
data=mtcars)
x <- range(mtcars$hp[mtcars$cyl==i])
y <- coef(fit)[1] + coef(fit)[2]*x
segments(x[1], y[1], x[2], y[2],
col=cls[(i-2)/2])
}
sometimes we want to combine several graphs into one
par(mfrow=c(2, 2))
hist(rnorm(1000), 50,
main="Normal Distribution",
ylab="", xlab="")
hist(rt(1000, 2), 50,
main="t Distribution, df=2",
ylab="", xlab="")
hist(runif(1000), 50,
main="Uniform Distribution",
ylab="", xlab="")
hist(rchisq(1000, 2), 50,
main = expression(paste(chi^2,
" Distribution, df=2")),
ylab="", xlab="")
This works fine for rectangular arrays. For more complicated graphs we have the layout command. Let’s create a scatterplot of mpg by hp with marginal boxplots:
layout(matrix(c(2, 0, 1, 3), 2, 2,
byrow = TRUE),
c(3, 1), c(1, 3), TRUE)
par(mar = c(3, 3, 1, 1))
plot(mtcars$hp, mtcars$mpg)
par(mar = c(0, 3, 1, 1))
boxplot(mtcars$hp,
axes = FALSE, horizontal = TRUE)
par(mar = c(3, 0, 1, 1))
boxplot(mtcars$mpg, axes = FALSE)
When I write a paper or a talk or anything else that requires some graphs I always write a routine plot.mytalk, which has everything to recreate every graph. This is a great way to not only make changes if necessary but also to remind myself what I did back years ago when I wrote it.
So I might have the following function to do the graph above:
figure1 <- function() {
layout(matrix(c(2, 0, 1, 3), 2, 2,
byrow = TRUE),
c(3, 1), c(1, 3), TRUE)
par(mar = c(3, 3, 1, 1))
plot(hp, mpg)
par(mar = c(0, 3, 1, 1))
boxplot(hp, axes = FALSE, horizontal = TRUE)
par(mar = c(3, 0, 1, 1))
boxplot(mpg, axes = FALSE)
}
Note these days I often do this with Rmarkdown instead.
we can use R to make other types of pictures. Consider this diagram, which I use to illustrate the topic of Cause vs Effect:
it is actually done with these commands:
plot(c(10, 100), c(40, 100), axes=F,
xlab="", ylab="", type="n")
text(15, 90, "Cause-Effect",
cex=1.2, adj = 0)
text(c(20, 40), c(65, 65), c("X", "Y"),
cex=1.1, adj=0)
arrows(25, 65, 38, 65)
text(56, 90, "Confounding Variable",
cex=1.2, adj=0)
text(c(60, 70, 80), c(50, 75, 50), c("X", "Z", "Y"),
cex=1.1, adj=0)
arrows(70, 70, 62, 55)
arrows(70, 70, 78, 55)
An important example of this are maps:
library(maps)
map("usa")
text(-87.623177, 41.881832, "Chicago", cex=1.3)
People have done some very funny things with this. Run this code in the console (you need to install the library cowsay first)
library(cowsay)
say("Hello world!", by = "cow")
Often we need to save a graph as a png or a postscript file, so we can include it in a webpage or a latex document. I have to do this often enough I wrote a routine for it:
graph.out <-
function (f, foldername, graphname, format = "png")
{
file <- paste0(foldername, graphname, ".", format)
cat(file)
if (format == "png")
png(file)
if (format == "pdf")
pdf(file)
if (format == "ps")
postscript(file, horizontal = F, pointsize = 17)
if (format == "eps") {
setEPS()
postscript(file, horizontal = F, pointsize = 17)
}
f()
dev.off()
}
here f is a function that produces a graph, so it might be called like this:
f <- function(x) hist(rnorm(1000), 50)
graph.out(f, "C:/mygraphs", "myhist")
and now there should be a file myhist.png in the folder C:/mygraphs.