At UPRM (and all other Universities) a big problem is retaining the students from year to year. That is, many first-year students never return for the second year, and so on. Some years ago our Chancellor put together a group of professors and asked us to find ways to improve the situation. Among other things we tried to see whether it was possible to identify those students that were at a high risk of not returning for the second year and those at risk for not graduating. We asked the Registrars office for some data and received the data set upr (now part of RESMA3.RData). Let’s see what is in there:
dim(upr)
## [1] 23666 16
so the data set has 23666 records and 16 variables. Those are
colnames(upr)
## [1] "ID.Code" "Year" "Gender" "Program.Code"
## [5] "Highschool.GPA" "Aptitud.Verbal" "Aptitud.Matem" "Aprov.Ingles"
## [9] "Aprov.Matem" "Aprov.Espanol" "IGS" "Freshmen.GPA"
## [13] "Graduated" "Year.Grad." "Grad..GPA" "Class.Facultad"
So, how could this data be used to tell us something about students at risk? Here are some ideas:
ggplot(data=upr, aes(Highschool.GPA, Freshmen.GPA)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE)
It appears there is a positive correlation between these two. Of course we can find the least squares regression:
summary(lm(Freshmen.GPA~Highschool.GPA, data=upr))
##
## Call:
## lm(formula = Freshmen.GPA ~ Highschool.GPA, data = upr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0675 -0.3826 0.0864 0.4985 2.1763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.85869 0.04897 -17.54 <2e-16
## Highschool.GPA 0.98155 0.01332 73.67 <2e-16
##
## Residual standard error: 0.7022 on 23449 degrees of freedom
## (215 observations deleted due to missingness)
## Multiple R-squared: 0.188, Adjusted R-squared: 0.1879
## F-statistic: 5428 on 1 and 23449 DF, p-value: < 2.2e-16
summary(lm(Freshmen.GPA~Highschool.GPA+Aptitud.Verbal+Aptitud.Matem, data=upr))
##
## Call:
## lm(formula = Freshmen.GPA ~ Highschool.GPA + Aptitud.Verbal +
## Aptitud.Matem, data = upr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2938 -0.3700 0.0881 0.4825 2.0421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.860e+00 5.663e-02 -32.84 <2e-16
## Highschool.GPA 8.960e-01 1.328e-02 67.48 <2e-16
## Aptitud.Verbal 1.604e-03 6.893e-05 23.26 <2e-16
## Aptitud.Matem 6.023e-04 5.541e-05 10.87 <2e-16
##
## Residual standard error: 0.6862 on 23447 degrees of freedom
## (215 observations deleted due to missingness)
## Multiple R-squared: 0.2245, Adjusted R-squared: 0.2244
## F-statistic: 2262 on 3 and 23447 DF, p-value: < 2.2e-16
summary(aov(Highschool.GPA~Class.Facultad, data = upr))
## Df Sum Sq Mean Sq F value Pr(>F)
## Class.Facultad 4 630.7 157.68 1714 <2e-16
## Residuals 23661 2176.7 0.09
dta <- upr[upr$Year<=2008,
c("Highschool.GPA", "Graduated")]
dta$GradInd <- ifelse(dta$Graduated=="Si", 1, 0)
plt <- ggplot(dta, aes(Highschool.GPA, GradInd)) +
geom_jitter(width=0, height=0.1)
plt
Here the outcome variable is discrete, so a simple regression won’t work. Instead one can try to predict the probability of success:
fit <- glm(GradInd~Highschool.GPA,
family=binomial,
data=dta)
x <- seq(2, 4, length=100)
df <- data.frame(x=x,
y=predict(fit, data.frame(Highschool.GPA=x),
type="response"))
plt +
geom_line(data=df, aes(x, y),
color="blue", size=1.2)
Now it turns out that all of these analyses (and many more) are all special cases of a general approach to statistics called the Linear Model!