General Statistics

Descriptive Statistics

In general in Statistics we distinguish between quantitative (= numerical) and categorical (= qualitative) data. The main difference is that for quantitative data doing arithmetic makes sense, for example calculating the mean.

Note that just because a data set has digits, it is not necessarily quantitative. For example, digits are often used as labels.

Sometimes a data set can be treated as either categorical or quantitative. Say we have a variable “number of times a student failed a course”. Now for some purposes one can treat this as quantitative, for example find the mean. For others we can treat it as categorical, for example do a table or a boxplot.


Consider the upr admissions data. Here are some simple things to do when looking at this kind of data:

Tables

gender.tbl <- table(upr$Gender)
names(gender.tbl) <- c("Female", "Male")
Percentage <- round(gender.tbl/sum(gender.tbl)*100, 1)
cbind(Counts=gender.tbl, Percentage)
##        Counts Percentage
## Female  11487       48.5
## Male    12179       51.5

Contingency Tables

tbl <- table(upr$Gender, upr$Class.Facultad)
tbl
##    
##     ADEM ARTES CIAG CIENCIAS INGE
##   F 1401  2570 1038     4127 2351
##   M 1091  1554 1288     2887 5359

In a contingency table percentages can be calculated in three ways:

First we find the grand, row and columns totals:

# overall total
ot <- sum(tbl)
ot
## [1] 23666
# row total
rt <- apply(tbl, 1, sum)
rt
##     F     M 
## 11487 12179
# column total
ct <- apply(tbl, 2, sum)
ct
##     ADEM    ARTES     CIAG CIENCIAS     INGE 
##     2492     4124     2326     7014     7710

then we use each to find percentages:

  • by grand total
tmp <- cbind(tbl, Total=rt)
tmp <- rbind(tmp, Total=c(ct, sum(ct)))
round(tmp/ot*100, 1)
##       ADEM ARTES CIAG CIENCIAS INGE Total
## F      5.9  10.9  4.4     17.4  9.9  48.5
## M      4.6   6.6  5.4     12.2 22.6  51.5
## Total 10.5  17.4  9.8     29.6 32.6 100.0
  • by row total
round(tmp/c(rt, ot)*100, 1)
##       ADEM ARTES CIAG CIENCIAS INGE Total
## F     12.2  22.4  9.0     35.9 20.5   100
## M      9.0  12.8 10.6     23.7 44.0   100
## Total 10.5  17.4  9.8     29.6 32.6   100
  • by column total
t(round(t(tmp)/c(ct, ot)*100, 1))
##        ADEM ARTES  CIAG CIENCIAS  INGE Total
## F      56.2  62.3  44.6     58.8  30.5  48.5
## M      43.8  37.7  55.4     41.2  69.5  51.5
## Total 100.0 100.0 100.0    100.0 100.0 100.0

The command CrossTable in the gmodels package will do all of them:

library(gmodels)
CrossTable(upr$Gender, upr$Class.Facultad)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  23666 
## 
##  
##              | upr$Class.Facultad 
##   upr$Gender |      ADEM |     ARTES |      CIAG |  CIENCIAS |      INGE | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            F |      1401 |      2570 |      1038 |      4127 |      2351 |     11487 | 
##              |    30.297 |   161.341 |     7.334 |   153.350 |   517.240 |           | 
##              |     0.122 |     0.224 |     0.090 |     0.359 |     0.205 |     0.485 | 
##              |     0.562 |     0.623 |     0.446 |     0.588 |     0.305 |           | 
##              |     0.059 |     0.109 |     0.044 |     0.174 |     0.099 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            M |      1091 |      1554 |      1288 |      2887 |      5359 |     12179 | 
##              |    28.576 |   152.174 |     6.917 |   144.637 |   487.851 |           | 
##              |     0.090 |     0.128 |     0.106 |     0.237 |     0.440 |     0.515 | 
##              |     0.438 |     0.377 |     0.554 |     0.412 |     0.695 |           | 
##              |     0.046 |     0.066 |     0.054 |     0.122 |     0.226 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |      2492 |      4124 |      2326 |      7014 |      7710 |     23666 | 
##              |     0.105 |     0.174 |     0.098 |     0.296 |     0.326 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

sometimes you want only some of these. Say we want the row percentages:

args(CrossTable)
## function (x, y, digits = 3, max.width = 5, expected = FALSE, 
##     prop.r = TRUE, prop.c = TRUE, prop.t = TRUE, prop.chisq = TRUE, 
##     chisq = FALSE, fisher = FALSE, mcnemar = FALSE, resid = FALSE, 
##     sresid = FALSE, asresid = FALSE, missing.include = FALSE, 
##     format = c("SAS", "SPSS"), dnn = NULL, ...) 
## NULL
CrossTable(upr$Gender, upr$Class.Facultad,
         prop.c=FALSE, prop.chisq = FALSE, prop.t=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  23666 
## 
##  
##              | upr$Class.Facultad 
##   upr$Gender |      ADEM |     ARTES |      CIAG |  CIENCIAS |      INGE | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            F |      1401 |      2570 |      1038 |      4127 |      2351 |     11487 | 
##              |     0.122 |     0.224 |     0.090 |     0.359 |     0.205 |     0.485 | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
##            M |      1091 |      1554 |      1288 |      2887 |      5359 |     12179 | 
##              |     0.090 |     0.128 |     0.106 |     0.237 |     0.440 |     0.515 | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## Column Total |      2492 |      4124 |      2326 |      7014 |      7710 |     23666 | 
## -------------|-----------|-----------|-----------|-----------|-----------|-----------|
## 
## 

Bar Charts

ggplot(upr, aes(Class.Facultad)) + 
  geom_bar(alpha=0.75, fill="lightblue") +
  xlab("")

ggplot(upr, aes(Class.Facultad, fill=Gender)) + 
  geom_bar(position="dodge", alpha=0.75) 

as with the tables, graphs can be done based on percentages:

  • grand total
ggplot(upr, aes(Class.Facultad, fill=Gender)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)*100),
           position="dodge", 
           alpha=0.75) +
  ylab("Percentage")

  • row total

for this one we have to work a bit:

tmp1 <- c(tmp[1, 1:5], tmp[2, 1:5])/c(rt, rt)*100
df <- data.frame(Percentage = tmp1,
                 Gender=rep(c("Female", "Male"), 5),
                 Class=names(tmp1))
ggplot(df, aes(x = Class, 
               y = Percentage,
               fill = Gender)) + 
    geom_bar(position = "dodge",
             stat = "identity") 

Notice the use of stat=“identity” if the data is already in the form of a table.

Numerical Summaries

round(mean(upr$Freshmen.GPA), 3)
## [1] NA

we get an error because there are missing values, so

round(mean(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 2.733
round(median(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 2.83
round(sd(upr$Freshmen.GPA, na.rm=TRUE), 3)
## [1] 0.779
round(quantile(upr$Freshmen.GPA, 
               probs = c(0.1, 0.25, 0.75, 0.9),
               na.rm=TRUE), 3)
##  10%  25%  75%  90% 
## 1.71 2.32 3.28 3.65

Histogram and Boxplot

bw <- diff(range(upr$Freshmen.GPA, na.rm = TRUE))/50 # use about 50 bins
ggplot(upr, aes(Freshmen.GPA)) +
  geom_histogram(color = "black", 
                 fill = "white", 
                 binwidth = bw) + 
  labs(x = "Freshmen GPA", y = "Counts")

ggplot(upr, aes(x="", y=Freshmen.GPA)) + 
  geom_boxplot() + 
  xlab("")

ggplot(upr, aes(factor(Year), Freshmen.GPA)) + 
  geom_boxplot() +
  xlab("Year")

Two Quantitative Variables

round(cor(upr$Year, upr$Freshmen.GPA, 
    use="complete.obs"), 3)
## [1] 0.097
ggplot(upr, aes(Year, Freshmen.GPA)) + 
  geom_point() +
  scale_x_continuous(breaks = 2003:2013) +
  labs(x="Year", y="GPA after Freshmen Year")

this is not so good because we cant see many of the dots. Here is a better solution:

ggplot(upr, aes(Year, Freshmen.GPA)) + 
  geom_jitter(shape=".", width=0.1, height = 0) +
  scale_x_continuous(breaks = 2003:2013) +
  labs(x="Year", y="GPA after Freshmen Year")

Normal Probability Plot

An important graph is the normal probability plot, which plots the sample quantiles vs the population quantiles of a normal distribution:

x <- rnorm(20)
df <- data.frame(x)
ggplot(data=df, aes(sample=x)) +
   geom_qq() + geom_qq_line()       

Here an example where the normal assumption fails:

x <- rexp(20)
df <- data.frame(x=x)
ggplot(data=df, aes(sample=x)) +
   geom_qq() + geom_qq_line()