Exercises 4

Case Study: Survey of Students

This is the same data set we considerd in the previous exerceises.

Problem 1 Find the best model to predict Score from GPA and Distance.

Problem 2 Analyse the data with score as the response and Gender and Years as factors (predictors).

Problem 3 Find a 90% interval estimate for the score of a male student with a GPA of 2.15. Is this an interpolation or an extrapolation?

Problem 4 Find a 95% interval estimate for the score of a 20 year old male student who lives 2 miles from school. (code the variable Gender, ignore the issue of parallel lines)

attach(studentsurvey)

Problem 1 Find the best model to predict Score from GPA and Distance.

In Exercise Problems 3 we found a linear model in GPA and a log model in Distance, so let’s try this:

mlr(Score, cbind(GPA, log(Distance + 1)))

## The least squares regression equation is: 
##  Score  =  3.089 + 1.329 GPA + 0.024  
## R^2 = 11%

the plots look good, so no problem with the assumptions.

Notice that the name of the variable Distance is missing. If we want to fix that we can do this:

X <- cbind(GPA, log(Distance + 1))
colnames(X) <- c("GPA", "log(Distance + 1)") 
mlr(Score, X)

## The least squares regression equation is: 
##  Score  =  3.089 + 1.329 GPA + 0.024 log(Distance + 1) 
## R^2 = 11%

Can we simplify the model?

mallows(Score, X)

##  Number of Variables Cp   GPA log(Distance + 1)
##  1                   1.04 X                    
##  2                   3    X   X

the smallest C_p is for the model with GPA only, so this is best.

slr(GPA, Score)

## The least squares regression equation is: 
##  GPA  = 1.845 + 0.083 Score 
## R^2 = 11%

Problem 2 Analyse the data with score as the response and Gender and Years as factors (predictors).

this is a twoway ANOVA problem. In Exercise Problems 2 we already looked at the boxplots and the summary statistics. Next we need to consider any possible interaction:

In problem 3 of the Exercise Problems 2 we found a statistically significant correlation between Score and GPA. Let’s find a good model.

iplot(Score, Gender, Year)

there seems to be interaction here. Can we test for it? We would need repeated measurements:

table(Gender, Year)

##         Year
## Gender   Freshman Junior Senior Sophomore
##   Female       28     23     26        34
##   Male         38     24     33        43

which we have. So

twoway(Score, Gender, Year)

##              Df Sum Sq Mean Sq F value  Pr(>F)
## x             1   35.2   35.23   8.742 0.00342
## z             3   11.4    3.79   0.940 0.42212
## x:z           3    7.9    2.63   0.652 0.58255
## Residuals   241  971.3    4.03                
##                   [,1]
## Gender  p =     0.0034
## Year  p =       0.4221
## Interaction p = 0.5825

the plots look good, so no problems with the assumptions.

The test for interaction has p=0.5825 , so there is no evidence of interaction. We can refit with out it:

twoway(Score, Gender, Year, with.interaction=FALSE)

##              Df Sum Sq Mean Sq F value  Pr(>F)
## x             1   35.2   35.23   8.780 0.00335
## z             3   11.4    3.79   0.944 0.42013
## Residuals   244  979.2    4.01                
##               [,1]
## Gender  p = 0.0033
## Year  p =   0.4201

The test for Year has p=0.42, so Year does not effect score and so we can drop the term as well. We are now back to a oneway of Score by GPA, which we already analysed in Exercise Problems 2 and 3.

Problem 3 Find a 90% interval estimate for the score of a male student with a GPA of 2.15. Is this an interpolation or an extrapolation?

We have a quantitative response (Score), a quantitative predictor (GPA) and a catgorical predictor( Gender), so this is a regression problem with a dummy variable.

dlr(Score, GPA, Gender)

## The least squares regression equation is: 
##  Score  =  6 + 0.287 GPA - 5.162 Gender + 1.877 GPA*Gender 
## R^2 = 19.6

do we need the product term?

independent.lines <- dlr(Score, GPA, Gender, return.model=TRUE)
parallel.lines <- dlr(Score, GPA, Gender, additive=TRUE,
                 return.model=TRUE)
nested.models.test(parallel.lines, independent.lines)

## H0: both models are equally good.
## p value= 0.000

gives p=0.000, so the product term is indeed necessary.

Next we want a 90% interval estimate for the score of a male student with a GPA of 2.15:

dlr.predict(Score, GPA, Gender, newx=2.15, 
            newz="Male", interval="PI", conf.level=90)

##    GPA Gender  Fit Lower Upper
## 1 2.15   Male 5.49  2.45  8.53

Is this an interpolation or an extrapolation?

range(GPA[Gender=="Male"])

## [1] 0.71 3.46

shows that 2.15 is in the range of data values for males, so this is an interpolation.

Problem 4 Find a 95% interval estimate for the score of a 20 year old male student who lives 2 miles from school. (code the variable Gender, ignore the issue of parallel lines)

As before we need to remove observation #220 from the data set and we need to use the log tranform on Distance:

X <- data.frame(Distance=log(Distance[-220]+1), Age=Age[-220], Gender=ifelse(Gender[-220]=="Male", 1, 0))
mlr(Score[-220] , X)

## The least squares regression equation is: 
##  Score[-220]  =  8.905 + 0.082 Distance - 0.118 Age - 0.745 Gender 
## R^2 = 3.8%

When doing prediction there is usually no reason to simplify the model, so we won’t use Mallows C_p. Now

newx <- cbind(Distance=log(2)+1, Age=20, Gender=1)
mlr.predict(Score[-220], X, newx=newx, interval="PI")

##  Distance Age Gender  Fit Lower Upper
##  1.693147  20      1 5.95  1.99  9.91