In general in Statistics we distinguish between quantitative (= numerical) and categorical (= qualitative) data. The main difference is that for quantitative data doing arithmetic makes sense, for example calculating the mean.
Note that just because a data set has digits, it is not necessarily quantitative. For example, digits are often used as labels.
Sometimes a data set can be treated as either categorical or quantitative. Say we have a variable “number of times a student failed a course”. Now for some purposes one can treat this as quantitative, for example find the mean. For others we can treat it as categorical, for example do a table or a boxplot.
Consider the upr admissions data. Here are some simple things to do when looking at this kind of data:
gender.tbl <- table(upr$Gender)
names(gender.tbl) <- c("Female", "Male")
Percentage <- round(gender.tbl/sum(gender.tbl)*100, 1)
cbind(Counts=gender.tbl, Percentage)
## Counts Percentage
## Female 11487 48.5
## Male 12179 51.5
tbl <- table(upr$Gender, upr$Class.Facultad)
tbl
##
## ADEM ARTES CIAG CIENCIAS INGE
## F 1401 2570 1038 4127 2351
## M 1091 1554 1288 2887 5359
In a contingency table percentages can be calculated in three ways:
First we find the grand, row and columns totals:
# overall total
ot <- sum(tbl)
ot
## [1] 23666
# row total
rt <- apply(tbl, 1, sum)
rt
## F M
## 11487 12179
# column total
ct <- apply(tbl, 2, sum)
ct
## ADEM ARTES CIAG CIENCIAS INGE
## 2492 4124 2326 7014 7710
then we use each to find percentages:
tmp <- cbind(tbl, Total=rt)
tmp <- rbind(tmp, Total=c(ct, sum(ct)))
round(tmp/ot*100, 1)
## ADEM ARTES CIAG CIENCIAS INGE Total
## F 5.9 10.9 4.4 17.4 9.9 48.5
## M 4.6 6.6 5.4 12.2 22.6 51.5
## Total 10.5 17.4 9.8 29.6 32.6 100.0
round(tmp/c(rt, ot)*100, 1)
## ADEM ARTES CIAG CIENCIAS INGE Total
## F 12.2 22.4 9.0 35.9 20.5 100
## M 9.0 12.8 10.6 23.7 44.0 100
## Total 10.5 17.4 9.8 29.6 32.6 100
t(round(t(tmp)/c(ct, ot)*100, 1))
## ADEM ARTES CIAG CIENCIAS INGE Total
## F 56.2 62.3 44.6 58.8 30.5 48.5
## M 43.8 37.7 55.4 41.2 69.5 51.5
## Total 100.0 100.0 100.0 100.0 100.0 100.0
The command CrossTable in the gmodels package will do all of them:
library(gmodels)
CrossTable(upr$Gender, upr$Class.Facultad)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 23666
##
##
## | upr$Class.Facultad
## upr$Gender | ADEM | ARTES | CIAG | CIENCIAS | INGE | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## F | 1401 | 2570 | 1038 | 4127 | 2351 | 11487 |
## | 30.297 | 161.341 | 7.334 | 153.350 | 517.240 | |
## | 0.122 | 0.224 | 0.090 | 0.359 | 0.205 | 0.485 |
## | 0.562 | 0.623 | 0.446 | 0.588 | 0.305 | |
## | 0.059 | 0.109 | 0.044 | 0.174 | 0.099 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## M | 1091 | 1554 | 1288 | 2887 | 5359 | 12179 |
## | 28.576 | 152.174 | 6.917 | 144.637 | 487.851 | |
## | 0.090 | 0.128 | 0.106 | 0.237 | 0.440 | 0.515 |
## | 0.438 | 0.377 | 0.554 | 0.412 | 0.695 | |
## | 0.046 | 0.066 | 0.054 | 0.122 | 0.226 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 2492 | 4124 | 2326 | 7014 | 7710 | 23666 |
## | 0.105 | 0.174 | 0.098 | 0.296 | 0.326 | |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
sometimes you want only some of these. Say we want the row percentages:
args(CrossTable)
## function (x, y, digits = 3, max.width = 5, expected = FALSE,
## prop.r = TRUE, prop.c = TRUE, prop.t = TRUE, prop.chisq = TRUE,
## chisq = FALSE, fisher = FALSE, mcnemar = FALSE, resid = FALSE,
## sresid = FALSE, asresid = FALSE, missing.include = FALSE,
## format = c("SAS", "SPSS"), dnn = NULL, ...)
## NULL
CrossTable(upr$Gender, upr$Class.Facultad,
prop.c=FALSE, prop.chisq = FALSE, prop.t=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 23666
##
##
## | upr$Class.Facultad
## upr$Gender | ADEM | ARTES | CIAG | CIENCIAS | INGE | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## F | 1401 | 2570 | 1038 | 4127 | 2351 | 11487 |
## | 0.122 | 0.224 | 0.090 | 0.359 | 0.205 | 0.485 |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## M | 1091 | 1554 | 1288 | 2887 | 5359 | 12179 |
## | 0.090 | 0.128 | 0.106 | 0.237 | 0.440 | 0.515 |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 2492 | 4124 | 2326 | 7014 | 7710 | 23666 |
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##
##
ggplot(upr, aes(Class.Facultad)) +
geom_bar(alpha=0.75, fill="lightblue") +
xlab("")
ggplot(upr, aes(Class.Facultad, fill=Gender)) +
geom_bar(position="dodge", alpha=0.75)
as with the tables, graphs can be done based on percentages:
ggplot(upr, aes(Class.Facultad, fill=Gender)) +
geom_bar(aes(y = (..count..)/sum(..count..)*100),
position="dodge",
alpha=0.75) +
ylab("Percentage")
for this one we have to work a bit:
tmp1 <- c(tmp[1, 1:5], tmp[2, 1:5])/c(rt, rt)*100
df <- data.frame(Percentage = tmp1,
Gender=rep(c("Female", "Male"), 5),
Class=names(tmp1))
ggplot(df, aes(x = Class,
y = Percentage,
fill = Gender)) +
geom_bar(position = "dodge",
stat = "identity")
Notice the use of stat=“identity” if the data is already in the form of a table.
round(mean(upr$Freshmen.GPA), 3)
## [1] NA
we get an error because there are missing values, so
round(mean(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 2.733
round(median(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 2.83
round(sd(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 0.779
round(quantile(upr$Freshmen.GPA,
probs = c(0.1, 0.25, 0.75, 0.9),
na.rm=TRUE), 3)
## 10% 25% 75% 90%
## 1.71 2.32 3.28 3.65
bw <- diff(range(upr$Freshmen.GPA, na.rm = TRUE))/50 # use about 50 bins
ggplot(upr, aes(Freshmen.GPA)) +
geom_histogram(color = "black",
fill = "white",
binwidth = bw) +
labs(x = "Freshmen GPA", y = "Counts")
ggplot(upr, aes(x="", y=Freshmen.GPA)) +
geom_boxplot() +
xlab("")
ggplot(upr, aes(factor(Year), Freshmen.GPA)) +
geom_boxplot() +
xlab("Year")
round(cor(upr$Year, upr$Freshmen.GPA,
use="complete.obs"), 3)
## [1] 0.097
ggplot(upr, aes(Year, Freshmen.GPA)) +
geom_point() +
scale_x_continuous(breaks = 2003:2013) +
labs(x="Year", y="GPA after Freshmen Year")
this is not so good because we cant see many of the dots. Here is a better solution:
ggplot(upr, aes(Year, Freshmen.GPA)) +
geom_jitter(shape=".", width=0.1, height = 0) +
scale_x_continuous(breaks = 2003:2013) +
labs(x="Year", y="GPA after Freshmen Year")
An important graph is the normal probability plot, which plots the sample quantiles vs the population quantiles of a normal distribution:
x <- rnorm(20)
df <- data.frame(x)
ggplot(data=df, aes(sample=x)) +
geom_qq() + geom_qq_line()
Here an example where the normal assumption fails:
x <- rexp(20)
df <- data.frame(x=x)
ggplot(data=df, aes(sample=x)) +
geom_qq() + geom_qq_line()