This is the same data set we considerd in the previous exerceises.
Problem 1 Find the best model to predict Score from GPA and Distance.
Problem 2 Analyse the data with score as the response and Gender and Years as factors (predictors).
Problem 3 Find a 90% interval estimate for the score of a male student with a GPA of 2.15. Is this an interpolation or an extrapolation?
Problem 4 Find a 95% interval estimate for the score of a 20 year old male student who lives 2 miles from school. (code the variable Gender, ignore the issue of parallel lines)
attach(studentsurvey)
Problem 1 Find the best model to predict Score from GPA and Distance.
In Exercise Problems 3 we found a linear model in GPA and a log model in Distance, so let’s try this:
mlr(Score, cbind(GPA, log(Distance + 1)))
## The least squares regression equation is:
## Score = 3.089 + 1.329 GPA + 0.024
## R^2 = 11%
the plots look good, so no problem with the assumptions.
Notice that the name of the variable Distance is missing. If we want to fix that we can do this:
X <- cbind(GPA, log(Distance + 1))
colnames(X) <- c("GPA", "log(Distance + 1)")
mlr(Score, X)
## The least squares regression equation is:
## Score = 3.089 + 1.329 GPA + 0.024 log(Distance + 1)
## R^2 = 11%
Can we simplify the model?
mallows(Score, X)
## Number of Variables Cp GPA log(Distance + 1)
## 1 1.04 X
## 2 3 X X
the smallest Cp is for the model with GPA only, so this is best.
slr(GPA, Score)
## The least squares regression equation is:
## GPA = 1.845 + 0.083 Score
## R^2 = 11%
Problem 2 Analyse the data with score as the response and Gender and Years as factors (predictors).
this is a twoway ANOVA problem. In Exercise Problems 2 we already looked at the boxplots and the summary statistics. Next we need to consider any possible interaction:
In problem 3 of the Exercise Problems 2 we found a statistically significant correlation between Score and GPA. Let’s find a good model.
iplot(Score, Gender, Year)
there seems to be interaction here. Can we test for it? We would need repeated measurements:
table(Gender, Year)
## Year
## Gender Freshman Junior Senior Sophomore
## Female 28 23 26 34
## Male 38 24 33 43
which we have. So
twoway(Score, Gender, Year)
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 35.2 35.23 8.742 0.00342
## z 3 11.4 3.79 0.940 0.42212
## x:z 3 7.9 2.63 0.652 0.58255
## Residuals 241 971.3 4.03
## [,1]
## Gender p = 0.0034
## Year p = 0.4221
## Interaction p = 0.5825
the plots look good, so no problems with the assumptions.
The test for interaction has p=0.5825 , so there is no evidence of interaction. We can refit with out it:
twoway(Score, Gender, Year, with.interaction=FALSE)
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 35.2 35.23 8.780 0.00335
## z 3 11.4 3.79 0.944 0.42013
## Residuals 244 979.2 4.01
## [,1]
## Gender p = 0.0033
## Year p = 0.4201
The test for Year has p=0.42, so Year does not effect score and so we can drop the term as well. We are now back to a oneway of Score by GPA, which we already analysed in Exercise Problems 2 and 3.
Problem 3 Find a 90% interval estimate for the score of a male student with a GPA of 2.15. Is this an interpolation or an extrapolation?
We have a quantitative response (Score), a quantitative predictor (GPA) and a catgorical predictor( Gender), so this is a regression problem with a dummy variable.
dlr(Score, GPA, Gender)
## The least squares regression equation is:
## Score = 6 + 0.287 GPA - 5.162 Gender + 1.877 GPA*Gender
## R^2 = 19.6
do we need the product term?
independent.lines <- dlr(Score, GPA, Gender, return.model=TRUE)
parallel.lines <- dlr(Score, GPA, Gender, additive=TRUE,
return.model=TRUE)
nested.models.test(parallel.lines, independent.lines)
## H0: both models are equally good.
## p value= 0.000
gives p=0.000, so the product term is indeed necessary.
Next we want a 90% interval estimate for the score of a male student with a GPA of 2.15:
dlr.predict(Score, GPA, Gender, newx=2.15,
newz="Male", interval="PI", conf.level=90)
## GPA Gender Fit Lower Upper
## 1 2.15 Male 5.49 2.45 8.53
Is this an interpolation or an extrapolation?
range(GPA[Gender=="Male"])
## [1] 0.71 3.46
shows that 2.15 is in the range of data values for males, so this is an interpolation.
Problem 4 Find a 95% interval estimate for the score of a 20 year old male student who lives 2 miles from school. (code the variable Gender, ignore the issue of parallel lines)
As before we need to remove observation #220 from the data set and we need to use the log tranform on Distance:
X <- data.frame(Distance=log(Distance[-220]+1), Age=Age[-220], Gender=ifelse(Gender[-220]=="Male", 1, 0))
mlr(Score[-220] , X)
## The least squares regression equation is:
## Score[-220] = 8.905 + 0.082 Distance - 0.118 Age - 0.745 Gender
## R^2 = 3.8%
When doing prediction there is usually no reason to simplify the model, so we won’t use Mallows Cp. Now
newx <- cbind(Distance=log(2)+1, Age=20, Gender=1)
mlr.predict(Score[-220], X, newx=newx, interval="PI")
## Distance Age Gender Fit Lower Upper
## 1.693147 20 1 5.95 1.99 9.91