Exercise Problems 2

Case Study: Survey of Students

This is a fake data set I made up for this exercise. It is supposed to be from a survey of students at some College. The data is in studentsurvey

The variables are

colnames(studentsurvey)
## [1] "Score"    "Gender"   "Year"     "GPA"      "Distance" "Major"   
## [7] "Age"

Score is a combination of several questions designed to measure how “happy” they are to study at the College. A high number means more happiness. Distance is how far they live from the College. Make sure your answers are complete.

Problem 1 Is there a relationship between Score and Gender?

Problem 2 Is there a relationship between Score and Year?

Problem 3 Is there a relationship between Score and GPA?

Problem 4 Is there a relationship between Score and Distance?

Problem 5 Is there a relationship between Score and Age?

Problem 6 Is there a relationship between Gender and Major?

Problem 7 Is there a relationship between Gender and Age?



 attach(studentsurvey) 
## The following object is masked from babe:
## 
##     Year
## The following object is masked from wrinccensus (pos = 10):
## 
##     Gender
## The following object is masked from wrinccensus (pos = 17):
## 
##     Gender
## The following object is masked from longjump (pos = 18):
## 
##     Year
## The following object is masked from longjump (pos = 27):
## 
##     Year
## The following object is masked from wrinccensus (pos = 29):
## 
##     Gender
## The following object is masked from wrinccensus (pos = 30):
## 
##     Gender
## The following object is masked from wrinccensus (pos = 32):
## 
##     Gender
## The following object is masked from wrinccensus (pos = 33):
## 
##     Gender

Problem 1 Is there a relationship between Score and Gender? Score is a quantitative variable and Gender is a categorical variable with two values, so this is a problem for ANOVA

bplot(Score,Gender) 

The boxplot shows a slight difference between the genders. There are a few slight outliers, but they are no problem.

stat.table(Score, Gender)
##        Sample Size Mean Standard Deviation
## Female         111  6.7                1.7
## Male           138  5.9                2.2

The test:

oneway(Score, Gender)

## p value of test of equal means: p = 0.0033 
## Smallest sd:  1.7    Largest sd : 2.2 
## A 95% confidence interval for the difference in group means is (0.3, 1.2)
  1. Parameters of interest: means of scores of men and women
  2. Method of analysis: two sample t test
  3. Assumptions of Method: residuals have a normal distribution, or sample sizes are large enough
  4. \(\alpha\) = 0.05
  5. Null hypothesis H0: \(\mu_1~ = \mu_2\) (groups have the same mean)
  6. Alternative hypothesis Ha: \(\mu_1 \ne \mu_2\) (groups have different means)
  7. p value = 0.0033
  8. 0.0033 < 0.05, there is some evidence that the group means are not the same, the women tend to score higher than the men.

Assumptions:

  1. Normal residuals: normal plot looks ok.
  2. equal variance: 3*1.71 = 5.13 > 2.21, ok

Problem 2 Is there a relationship between Score and Year? Score is a quantitative variable and Year is a categorical variable, so this is a problem for ANOVA

bplot(Score, Year)

The table of summary statistics is

stat.table(Score, Year)
##           Sample Size Mean Standard Deviation
## Junior             47  6.1                2.0
## Freshman           66  6.6                1.8
## Sophomore          77  6.2                2.2
## Senior             59  6.1                2.1

The test:

oneway(Score, Year)

## p value of test of equal means: p = 0.4855 
## Smallest sd:  1.8    Largest sd : 2.2
  1. Parameters of interest: means of scores by year
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, or sample sizes are large enough
  4. \(\alpha\) = 0.05
  5. Null hypothesis H0: \(\mu_1~ = \mu_2 = \mu_3 = \mu_4\) (groups have the same mean)
  6. Alternative hypothesis Ha: \(\mu_i \ne \mu_j\) (some groups have different means)
  7. p value = 0.4855
  8. 0.4855 > 0.05, there is no evidence that the group means are not the same.

Assumptions:

  1. Normal residuals. looks ok.
  2. equal variance: 3*1.8 = 5.4 > 2.2

Problem 3 Is there a relationship between Score and GPA? Score and GPA are both quantitative variables, so this is a problem for the Pearson’s Correlation Coefficient .

The marginal plot shows some increase in Score as the GPA increases.

mplot(Score, GPA)

There are a few slight outlier, no problem, though.

The test:

pearson.cor(Score, GPA, rho.null = 0)

## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.000
  1. Parameter of interest: Pearson’s correlation coefficient \(\rho\)
  2. Method of analysis: test based on normal theory
  3. Assumptions of Method: relationship is linear, there are no outliers
  4. \(\alpha\) = 0.05
  5. H0: \(\rho = 0\) (no relationship between Score and GPA)
  6. Ha: \(\rho \ne 0\) (some relationship between Score and GPA)
  7. p = 0.000
  8. 0 < 0.05, so we reject H0, there is a relationship between the Score and the GPA, apparently students with a higher GPA are happier.

Problem 4 Is there a relationship between Score and Distance? Score and Distance are both quantitative variables, so this is a problem for the Pearson’s Correlation Coefficient.

The marginal plot shows some slight outliers. The log transform fixes it.

mplot(Score, Distance)

mplot(Score, log(Distance+1))

some students apparently live very close to the school, Distance=0. For this reason we use log(Distance+1).

The test:

pearson.cor(Score, log(Distance+1), rho.null = 0)

## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.7305
  1. Parameter of interest: Pearson’s correlation coefficient \(\rho\)
  2. Method of analysis: test based on normal theory
  3. Assumptions of Method: relationship is linear, there are no outliers
  4. \(\alpha=0.05\)
  5. H0: \(\rho =0\) (no relationship between Score and Distance)
  6. Ha: \(\rho \ne 0\) (some relationship between Score and Distance)
  7. p = 0.7305
  8. \(p>\alpha\), so we fail to reject H0, there is no evidence of a relationship between the Score and the Distance.

Problem 5 Is there a relationship between Score and Age? The marginal plot shows one severe outlier:

length(Score)
## [1] 249
length(Age)
## [1] 249
mplot(Score, Age)

Unfortunately the log transform does not help, so the only way to preceed is to eliminate the outlier.

which(Age==max(Age))
## [1] 220

The test:

pearson.cor(Age[-220], Score[-220], rho.null = 0)

## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.3052
  1. Parameter of interest: Pearson’s correlation coefficient \(\rho\)
  2. Method of analysis: test based on normal theory
  3. Assumptions of Method: relationship is linear, there are no outliers
  4. \(\alpha = 0.05\)
  5. H0: \(\rho=0\) (no relationship between Score and Distance)
  6. Ha: \(\rho \ne 0\) (some relationship between Score and Distance)
  7. p = 0.3052
  8. \(p > \alpha\), so we fail to reject H0, there is no evidence of a relationship between the Score and the Age.

Problem 6 Is there a relationship between Gender and Major? Gender and Major are both categorical variables, so this is a problem for the Chisquare test of Independence.

 table(Gender, Major) 
##         Major
## Gender   Biology English Physics Psychology Spanish
##   Female      19      23      23         24      22
##   Male        24      24      31         29      30
chi.ind.test(table(Gender, Major))
## p value of test p=0.9664
  1. Parameters of interest: measure of association
  2. Method of analysis: chi-square test of independence
  3. Assumptions of Method: all expected counts greater than 5
  4. \(\alpha = 0.05\)
  5. H0: Classifications are independent = Gender and Major are independent
  6. Ha: Classifications are dependent = Gender and Major are not independent
  7. p = 0.9664
  8. 0.9664 > 0.05, there is no evidence of a relationship between gender and major.

Problem 7 Is there a relationship between Gender and Age? Age is a quantitative variable and Gender is a categorical variable with two values, so this is a problem for the ANOVA.

The boxplot shows a few serious outliers. One could try transformations, but because the outliers are from especially small and large observations these won’t work.

bplot(Age, Gender)

bplot(log(Age), Gender)

Solution 1: non-parametric method
The table of summary statistics is

stat.table(Age, Gender, Mean=FALSE )
##        Sample Size Median IQR
## Female         111     20   1
## Male           138     20   2

Now

kruskalwallis(Age, Gender)
## p value of test of equal means: p = 0.261060279952181
  1. Parameters of interest: 2 medians
  2. Method of analysis: Kruskal-Wallis
  3. Assumptions of Method: none
  4. \(\alpha=0.05\)
  5. Null hypothesis H0: M1=M2 (group medians are the same)
  6. Alternative hypothesis Ha: M1\(\ne\)M2 (group medians are not the same)
  7. p value = 0.2611
  8. 0.2611 > 0.05, so we fail to reject H0, it seems the group medians are the same.

Solution 2: remove outlier

which(Age==max(Age))
stat.table(Age[-220], Gender[-220])
##        Sample Size Mean Standard Deviation
## Female         111 19.9                  1
## Male           137 20.0                  1
oneway(Age[-220], Gender[-220])

## p value of test of equal means: p = 0.3817 
## Smallest sd:  1    Largest sd : 1 
## A 95% confidence interval for the difference in group means is (-0.4, 0.1)
  1. Parameters of interest: means of age by gender
  2. Method of analysis: ANOVA
  3. Assumptions of Method: residuals have a normal distribution, or sample sizes are large enough
  4. \(\alpha\) = 0.05
  5. Null hypothesis H0: \(\mu\)1 = \(\mu\)2 (groups have the same mean)
  6. Alternative hypothesis Ha: \(\mu\)1 \(\ne\) \(\mu\)2 (some groups have different means)
  7. p value = 0.3817
  8. 0.3817 > 0.05, there is no evidence that the group means are not the same. Assumptions:
  9. Normal residuals ok Smallest sd: 1 Largest sd : 1 , 3*1>1, ok