Exercise Problems 2

Case Study: Survey of Students

This is a fake data set I made up for this exercise. It is supposed to be from a survey of students at some College. The data is in studentsurvey

The variables are

colnames(studentsurvey)

## [1] "Score"    "Gender"   "Year"     "GPA"      "Distance" "Major"   
## [7] "Age"

Score is a combination of several questions designed to measure how “happy” they are to study at the College. A high number means more happiness. Distance is how far they live from the College. Make sure your answers are complete.

Problem 1 Is there a relationship between Score and Gender?

Problem 2 Is there a relationship between Score and Year?

Problem 3 Is there a relationship between Score and GPA?

Problem 4 Is there a relationship between Score and Distance?

Problem 5 Is there a relationship between Score and Age?

Problem 6 Is there a relationship between Gender and Major?

Problem 7 Is there a relationship between Gender and Age?

 attach(studentsurvey)

## The following object is masked from babe:
## 
##     Year

## The following object is masked from wrinccensus (pos = 10):
## 
##     Gender

## The following object is masked from wrinccensus (pos = 17):
## 
##     Gender

## The following object is masked from longjump (pos = 18):
## 
##     Year

## The following object is masked from longjump (pos = 27):
## 
##     Year

## The following object is masked from wrinccensus (pos = 29):
## 
##     Gender

## The following object is masked from wrinccensus (pos = 30):
## 
##     Gender

## The following object is masked from wrinccensus (pos = 32):
## 
##     Gender

## The following object is masked from wrinccensus (pos = 33):
## 
##     Gender

Problem 1 Is there a relationship between Score and Gender? Score is a quantitative variable and Gender is a categorical variable with two values, so this is a problem for ANOVA

bplot(Score,Gender)

The boxplot shows a slight difference between the genders. There are a few slight outliers, but they are no problem.

stat.table(Score, Gender)

##        Sample Size Mean Standard Deviation
## Female         111  6.7                1.7
## Male           138  5.9                2.2

The test:

oneway(Score, Gender)

## p value of test of equal means: p = 0.0033 
## Smallest sd:  1.7    Largest sd : 2.2 
## A 95% confidence interval for the difference in group means is (0.3, 1.2)

Parameters of interest: means of scores of men and women
Method of analysis: two sample t test
Assumptions of Method: residuals have a normal distribution, or sample sizes are large enough
\(\alpha\) = 0.05
Null hypothesis H₀: \(\mu_1~ = \mu_2\) (groups have the same mean)
Alternative hypothesis H_a: \(\mu_1 \ne \mu_2\) (groups have different means)
p value = 0.0033
0.0033 < 0.05, there is some evidence that the group means are not the same, the women tend to score higher than the men.

Assumptions:

Normal residuals: normal plot looks ok.
equal variance: 3*1.71 = 5.13 > 2.21, ok

Problem 2 Is there a relationship between Score and Year? Score is a quantitative variable and Year is a categorical variable, so this is a problem for ANOVA

bplot(Score, Year)

The table of summary statistics is

stat.table(Score, Year)

##           Sample Size Mean Standard Deviation
## Junior             47  6.1                2.0
## Freshman           66  6.6                1.8
## Sophomore          77  6.2                2.2
## Senior             59  6.1                2.1

The test:

oneway(Score, Year)

## p value of test of equal means: p = 0.4855 
## Smallest sd:  1.8    Largest sd : 2.2

Parameters of interest: means of scores by year
Method of analysis: ANOVA
Assumptions of Method: residuals have a normal distribution, or sample sizes are large enough
\(\alpha\) = 0.05
Null hypothesis H₀: \(\mu_1~ = \mu_2 = \mu_3 = \mu_4\) (groups have the same mean)
Alternative hypothesis H_a: \(\mu_i \ne \mu_j\) (some groups have different means)
p value = 0.4855
0.4855 > 0.05, there is no evidence that the group means are not the same.

Assumptions:

Normal residuals. looks ok.
equal variance: 3*1.8 = 5.4 > 2.2

Problem 3 Is there a relationship between Score and GPA? Score and GPA are both quantitative variables, so this is a problem for the Pearson’s Correlation Coefficient .

The marginal plot shows some increase in Score as the GPA increases.

mplot(Score, GPA)

There are a few slight outlier, no problem, though.

The test:

pearson.cor(Score, GPA, rho.null = 0)

## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.000

Parameter of interest: Pearson’s correlation coefficient \(\rho\)
Method of analysis: test based on normal theory
Assumptions of Method: relationship is linear, there are no outliers
\(\alpha\) = 0.05
H₀: \(\rho = 0\) (no relationship between Score and GPA)
H_a: \(\rho \ne 0\) (some relationship between Score and GPA)
p = 0.000
0 < 0.05, so we reject H₀, there is a relationship between the Score and the GPA, apparently students with a higher GPA are happier.

Problem 4 Is there a relationship between Score and Distance? Score and Distance are both quantitative variables, so this is a problem for the Pearson’s Correlation Coefficient.

The marginal plot shows some slight outliers. The log transform fixes it.

mplot(Score, Distance)

mplot(Score, log(Distance+1))

some students apparently live very close to the school, Distance=0. For this reason we use log(Distance+1).

The test:

pearson.cor(Score, log(Distance+1), rho.null = 0)

## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.7305

Parameter of interest: Pearson’s correlation coefficient \(\rho\)
Method of analysis: test based on normal theory
Assumptions of Method: relationship is linear, there are no outliers
\(\alpha=0.05\)
H₀: \(\rho =0\) (no relationship between Score and Distance)
H_a: \(\rho \ne 0\) (some relationship between Score and Distance)
p = 0.7305
\(p>\alpha\), so we fail to reject H₀, there is no evidence of a relationship between the Score and the Distance.

Problem 5 Is there a relationship between Score and Age? The marginal plot shows one severe outlier:

length(Score)

## [1] 249

length(Age)

## [1] 249

mplot(Score, Age)

Unfortunately the log transform does not help, so the only way to preceed is to eliminate the outlier.

which(Age==max(Age))

## [1] 220

The test:

pearson.cor(Age[-220], Score[-220], rho.null = 0)

## p value of test H0: rho=0 vs. Ha: rho <> 0:  0.3052

Parameter of interest: Pearson’s correlation coefficient \(\rho\)
Method of analysis: test based on normal theory
Assumptions of Method: relationship is linear, there are no outliers
\(\alpha = 0.05\)
H₀: \(\rho=0\) (no relationship between Score and Distance)
H_a: \(\rho \ne 0\) (some relationship between Score and Distance)
p = 0.3052
\(p > \alpha\), so we fail to reject H₀, there is no evidence of a relationship between the Score and the Age.

Problem 6 Is there a relationship between Gender and Major? Gender and Major are both categorical variables, so this is a problem for the Chisquare test of Independence.

 table(Gender, Major)

##         Major
## Gender   Biology English Physics Psychology Spanish
##   Female      19      23      23         24      22
##   Male        24      24      31         29      30

chi.ind.test(table(Gender, Major))

## p value of test p=0.9664

Parameters of interest: measure of association
Method of analysis: chi-square test of independence
Assumptions of Method: all expected counts greater than 5
\(\alpha = 0.05\)
H₀: Classifications are independent = Gender and Major are independent
H_a: Classifications are dependent = Gender and Major are not independent
p = 0.9664
0.9664 > 0.05, there is no evidence of a relationship between gender and major.

Problem 7 Is there a relationship between Gender and Age? Age is a quantitative variable and Gender is a categorical variable with two values, so this is a problem for the ANOVA.

The boxplot shows a few serious outliers. One could try transformations, but because the outliers are from especially small and large observations these won’t work.

bplot(Age, Gender)

bplot(log(Age), Gender)

Solution 1: non-parametric method
The table of summary statistics is

stat.table(Age, Gender, Mean=FALSE )

##        Sample Size Median IQR
## Female         111     20   1
## Male           138     20   2

Now

kruskalwallis(Age, Gender)

## p value of test of equal means: p = 0.261060279952181

Parameters of interest: 2 medians
Method of analysis: Kruskal-Wallis
Assumptions of Method: none
\(\alpha=0.05\)
Null hypothesis H₀: M₁=M₂ (group medians are the same)
Alternative hypothesis H_a: M₁\(\ne\)M₂ (group medians are not the same)
p value = 0.2611
0.2611 > 0.05, so we fail to reject H₀, it seems the group medians are the same.

Solution 2: remove outlier

which(Age==max(Age))

stat.table(Age[-220], Gender[-220])

##        Sample Size Mean Standard Deviation
## Female         111 19.9                  1
## Male           137 20.0                  1

oneway(Age[-220], Gender[-220])

## p value of test of equal means: p = 0.3817 
## Smallest sd:  1    Largest sd : 1 
## A 95% confidence interval for the difference in group means is (-0.4, 0.1)

Parameters of interest: means of age by gender
Method of analysis: ANOVA
Assumptions of Method: residuals have a normal distribution, or sample sizes are large enough
\(\alpha\) = 0.05
Null hypothesis H₀: \(\mu\)₁ = \(\mu\)₂ (groups have the same mean)
Alternative hypothesis H_a: \(\mu\)₁ \(\ne\) \(\mu\)₂ (some groups have different means)
p value = 0.3817
0.3817 > 0.05, there is no evidence that the group means are not the same. Assumptions:
Normal residuals ok Smallest sd: 1 Largest sd : 1 , 3*1>1, ok