This is a fake data set I made up for this exercise. It is supposed to be from a survey of students at some College. The data is in studentsurvey
The variables are
colnames(studentsurvey)
## [1] "Score" "Gender" "Year" "GPA" "Distance" "Major"
## [7] "Age"
Score is a combination of several questions designed to measure how “happy” they are to study at the College. A high number means more happiness. Distance is how far they live from the College. Make sure your answers are complete.
Problem 1 Is there a relationship between Score and Gender?
Problem 2 Is there a relationship between Score and Year?
Problem 3 Is there a relationship between Score and GPA?
Problem 4 Is there a relationship between Score and Distance?
Problem 5 Is there a relationship between Score and Age?
Problem 6 Is there a relationship between Gender and Major?
Problem 7 Is there a relationship between Gender and Age?
attach(studentsurvey)
## The following object is masked from babe:
##
## Year
## The following object is masked from wrinccensus (pos = 10):
##
## Gender
## The following object is masked from wrinccensus (pos = 17):
##
## Gender
## The following object is masked from longjump (pos = 18):
##
## Year
## The following object is masked from longjump (pos = 27):
##
## Year
## The following object is masked from wrinccensus (pos = 29):
##
## Gender
## The following object is masked from wrinccensus (pos = 30):
##
## Gender
## The following object is masked from wrinccensus (pos = 32):
##
## Gender
## The following object is masked from wrinccensus (pos = 33):
##
## Gender
Problem 1 Is there a relationship between Score and Gender? Score is a quantitative variable and Gender is a categorical variable with two values, so this is a problem for ANOVA
bplot(Score,Gender)
The boxplot shows a slight difference between the genders. There are a few slight outliers, but they are no problem.
stat.table(Score, Gender)
## Sample Size Mean Standard Deviation
## Female 111 6.7 1.7
## Male 138 5.9 2.2
The test:
oneway(Score, Gender)
## p value of test of equal means: p = 0.0033
## Smallest sd: 1.7 Largest sd : 2.2
## A 95% confidence interval for the difference in group means is (0.3, 1.2)
Assumptions:
Problem 2 Is there a relationship between Score and Year? Score is a quantitative variable and Year is a categorical variable, so this is a problem for ANOVA
bplot(Score, Year)
The table of summary statistics is
stat.table(Score, Year)
## Sample Size Mean Standard Deviation
## Junior 47 6.1 2.0
## Freshman 66 6.6 1.8
## Sophomore 77 6.2 2.2
## Senior 59 6.1 2.1
The test:
oneway(Score, Year)
## p value of test of equal means: p = 0.4855
## Smallest sd: 1.8 Largest sd : 2.2
Assumptions:
Problem 3 Is there a relationship between Score and GPA? Score and GPA are both quantitative variables, so this is a problem for the Pearson’s Correlation Coefficient .
The marginal plot shows some increase in Score as the GPA increases.
mplot(Score, GPA)
There are a few slight outlier, no problem, though.
The test:
pearson.cor(Score, GPA, rho.null = 0)
## p value of test H0: rho=0 vs. Ha: rho <> 0: 0.000
Problem 4 Is there a relationship between Score and Distance? Score and Distance are both quantitative variables, so this is a problem for the Pearson’s Correlation Coefficient.
The marginal plot shows some slight outliers. The log transform fixes it.
mplot(Score, Distance)
mplot(Score, log(Distance+1))
some students apparently live very close to the school, Distance=0. For this reason we use log(Distance+1).
The test:
pearson.cor(Score, log(Distance+1), rho.null = 0)
## p value of test H0: rho=0 vs. Ha: rho <> 0: 0.7305
Problem 5 Is there a relationship between Score and Age? The marginal plot shows one severe outlier:
length(Score)
## [1] 249
length(Age)
## [1] 249
mplot(Score, Age)
Unfortunately the log transform does not help, so the only way to preceed is to eliminate the outlier.
which(Age==max(Age))
## [1] 220
The test:
pearson.cor(Age[-220], Score[-220], rho.null = 0)
## p value of test H0: rho=0 vs. Ha: rho <> 0: 0.3052
Problem 6 Is there a relationship between Gender and Major? Gender and Major are both categorical variables, so this is a problem for the Chisquare test of Independence.
table(Gender, Major)
## Major
## Gender Biology English Physics Psychology Spanish
## Female 19 23 23 24 22
## Male 24 24 31 29 30
chi.ind.test(table(Gender, Major))
## p value of test p=0.9664
Problem 7 Is there a relationship between Gender and Age? Age is a quantitative variable and Gender is a categorical variable with two values, so this is a problem for the ANOVA.
The boxplot shows a few serious outliers. One could try transformations, but because the outliers are from especially small and large observations these won’t work.
bplot(Age, Gender)
bplot(log(Age), Gender)
Solution 1: non-parametric method
The table of summary statistics is
stat.table(Age, Gender, Mean=FALSE )
## Sample Size Median IQR
## Female 111 20 1
## Male 138 20 2
Now
kruskalwallis(Age, Gender)
## p value of test of equal means: p = 0.261060279952181
Solution 2: remove outlier
which(Age==max(Age))
stat.table(Age[-220], Gender[-220])
## Sample Size Mean Standard Deviation
## Female 111 19.9 1
## Male 137 20.0 1
oneway(Age[-220], Gender[-220])
## p value of test of equal means: p = 0.3817
## Smallest sd: 1 Largest sd : 1
## A 95% confidence interval for the difference in group means is (-0.4, 0.1)