At UPRM (and all other Universities) a big problem is retaining the students from year to year. That is, many first-year students never return for the second year, and so on. Some years ago our Chancellor put together a group of professors and asked us to find ways to improve the situation. Among other things we tried to see whether it was possible to identify those students that were at a high risk of not returning for the second year and those at risk for not graduating. We asked the Registrars office for some data and received the data set upr (now part of RESMA3.RData). Let’s see what is in there:

dim(upr)
## [1] 23666    16

so the data set has 23666 records and 16 variables. Those are

colnames(upr)
##  [1] "ID.Code"        "Year"           "Gender"         "Program.Code"  
##  [5] "Highschool.GPA" "Aptitud.Verbal" "Aptitud.Matem"  "Aprov.Ingles"  
##  [9] "Aprov.Matem"    "Aprov.Espanol"  "IGS"            "Freshmen.GPA"  
## [13] "Graduated"      "Year.Grad."     "Grad..GPA"      "Class.Facultad"

So, how could this data be used to tell us something about students at risk? Here are some ideas:

  1. compare the high school and freshman gpa’s
ggplot(data=upr, aes(Highschool.GPA, Freshmen.GPA)) +
  geom_point() +
  geom_smooth(method = "lm", se=FALSE)

It appears there is a positive correlation between these two. Of course we can find the least squares regression:

summary(lm(Freshmen.GPA~Highschool.GPA, data=upr))
## 
## Call:
## lm(formula = Freshmen.GPA ~ Highschool.GPA, data = upr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0675 -0.3826  0.0864  0.4985  2.1763 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -0.85869    0.04897  -17.54   <2e-16
## Highschool.GPA  0.98155    0.01332   73.67   <2e-16
## 
## Residual standard error: 0.7022 on 23449 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.188,  Adjusted R-squared:  0.1879 
## F-statistic:  5428 on 1 and 23449 DF,  p-value: < 2.2e-16
  1. Why just use one predictor? Why not use more? We can do a multiple regression analysis:
summary(lm(Freshmen.GPA~Highschool.GPA+Aptitud.Verbal+Aptitud.Matem, data=upr))
## 
## Call:
## lm(formula = Freshmen.GPA ~ Highschool.GPA + Aptitud.Verbal + 
##     Aptitud.Matem, data = upr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2938 -0.3700  0.0881  0.4825  2.0421 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -1.860e+00  5.663e-02  -32.84   <2e-16
## Highschool.GPA  8.960e-01  1.328e-02   67.48   <2e-16
## Aptitud.Verbal  1.604e-03  6.893e-05   23.26   <2e-16
## Aptitud.Matem   6.023e-04  5.541e-05   10.87   <2e-16
## 
## Residual standard error: 0.6862 on 23447 degrees of freedom
##   (215 observations deleted due to missingness)
## Multiple R-squared:  0.2245, Adjusted R-squared:  0.2244 
## F-statistic:  2262 on 3 and 23447 DF,  p-value: < 2.2e-16
  1. It might well be that what puts a student in Arts at risk is not a big problem in the sciences. So one might do an ANOVA analysis
summary(aov(Highschool.GPA~Class.Facultad, data = upr))
##                   Df Sum Sq Mean Sq F value Pr(>F)
## Class.Facultad     4  630.7  157.68    1714 <2e-16
## Residuals      23661 2176.7    0.09
  1. of course the ultimate success in college is graduating, so one might be interested in the relationship of the high school GPA and whether or not somebody graduated. Of course people do need some time to do that, so we should only include the years 2003-2008 (say):
dta <- upr[upr$Year<=2008, 
           c("Highschool.GPA", "Graduated")]
dta$GradInd <- ifelse(dta$Graduated=="Si", 1, 0)
plt <- ggplot(dta, aes(Highschool.GPA, GradInd)) +
  geom_jitter(width=0, height=0.1)
plt

Here the outcome variable is discrete, so a simple regression won’t work. Instead one can try to predict the probability of success:

fit <- glm(GradInd~Highschool.GPA, 
           family=binomial,
           data=dta)
x <- seq(2, 4, length=100)
df <- data.frame(x=x, 
         y=predict(fit, data.frame(Highschool.GPA=x), 
                                type="response"))
plt +
  geom_line(data=df, aes(x, y),
            color="blue", size=1.2)

Now it turns out that all of these analyses (and many more) are all special cases of a general approach to statistics called the Linear Model!