or
The method we will study now was first developed by
Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocain use.
The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn.
Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome
Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff
Obstetrics and Gynecology 1989;74:882-885
Basic Question: does the cocain use of the mother influence the health of the baby?
head(mothers)
## Status Length
## 1 Drug Free 44.3
## 2 Drug Free 45.3
## 3 Drug Free 46.9
## 4 Drug Free 47.0
## 5 Drug Free 47.2
## 6 Drug Free 47.8
attach(mothers)
table(Status)
## Status
## Drug Free First Trimester Throughout
## 39 19 36
So we have three groups: Drug Free, 1st Trimester and Throughout. We are interested in the population mean lenghts, specifically we want to know whether they are the same.
Note: For historical reasons a categorical predictor is often called a factor
As always we begin with a graph. When comparing a quantitative and a categorical variable this should be the boxplot:
bplot(Length, Status)
we see some evidence of shorter babies if the mother was using cocain longer.
Note that the categorical variable Status has an ordering:
Drug Free < First Trimester < Throughout
Any graphs or tables should have this ordering! Here we get lucky because this is also the alphabetic ordering, otherwise we would need to change that.
But there is also something very strange in what this graph says about our dataset, namely???
Next a table with the summary statistics. Usually this includes the sample size, mean and standard deviations. We can use the command
stat.table(Length, Status)
## Sample Size Mean Standard Deviation
## Drug Free 39 51.1 2.9
## First Trimester 19 49.3 2.5
## Throughout 36 48.0 3.6
By default this rounds to the one digit behind the decimal point. If we need to round differently use
stat.table(Length, Status, ndigit=3)
## Sample Size Mean Standard Deviation
## Drug Free 39 51.1 2.897
## First Trimester 19 49.3 2.502
## Throughout 36 48.0 3.601
Finally the test. The standard method for analyzing this type of data is called the ANOVA (ANalysis Of VAriance). If there is one factor (predictor) this is often called a oneway anova problem. We can run the test with the command
oneway(Length, Status)
## p value of test of equal means: p = 0.000
## Smallest sd: 2.5 Largest sd : 3.6
The details of the test are:
In step 3 we have the assumptions
The first assumption is about so called residuals. These are the observations minus their group means, so we could calculate
x1 <- Length[Status == "Drug Free"]
x1
## [1] 44.3 45.3 46.9 47.0 47.2 47.8 47.8 48.5 48.8 49.6 50.1 50.2 50.4 50.5
## [15] 50.5 50.8 50.9 50.9 51.2 51.3 51.4 51.4 51.7 51.7 51.7 51.7 51.8 52.1
## [29] 52.6 52.9 53.0 53.5 53.7 54.3 55.4 55.4 55.9 56.2 56.5
x1 - mean(x1)
## [1] -6.8 -5.8 -4.2 -4.1 -3.9 -3.3 -3.3 -2.6 -2.3 -1.5 -1.0 -0.9 -0.7 -0.6
## [15] -0.6 -0.3 -0.2 -0.2 0.1 0.2 0.3 0.3 0.6 0.6 0.6 0.6 0.7 1.0
## [29] 1.5 1.8 1.9 2.4 2.6 3.2 4.3 4.3 4.8 5.1 5.4
and the same for the other groups.
Now we could use the nplot command to do the graph. It is also done automatically by the oneway command!
How about the equal variance? Here we have the following rule of thumb: it is ok if the largest stdev is not more than 3 times the smallest. From the printout we see that
Smallest sd: 2.5 Largest sd : 3.6
and
3*2.5 = 7.5 > 3.6
so this is ok as well.
If the normal plot is ok but there is a problem with the equal varaince we can instead run
oneway(Length, Status, var.equal=FALSE)
the oneway command has an argument ndigit which can be used to change how much rounding is done. The default is 1 digit behind the decimal.
run.app(anova)
This app shows why this method is called the Analysis of Variance
What to do:
Just run the movie. As the means move farther apart the boxplots do as well. In the begining the overall variance is close the within-group variances, but as the means move farther apart so does the overall variance from the within-group variances. So by comparing the within-group variances to the overall variance we can see whether the means are the same or not.
We can choose to have all three groups different, or two the same and the third different. An interesting fact ot observe is this: if all groups are different and n=20 the p-value is 0.04 when \(\mu =0.6\), but if only group C is differnt the p-value is already 0.01, so the test considers this “more different”.
ANOVA in the from described here does not use any information regarding any ordering of the factor-levels. So for example consider the following two situations:
Data Set 1: Data on the prices of T-shirts and where they were bought. Is there a difference in the prices depending on the store location?
oneway(Price, Store)
## p value of test of equal means: p = 0.0046
## Smallest sd: 2.2 Largest sd : 3
Data Set 2: Data on the prices of T-shirts and their sizes. Is there a difference in the prices depending on the size?
oneway(Price, Size)
## p value of test of equal means: p = 0.0046
## Smallest sd: 2.2 Largest sd : 3
So the test has the same p value in both cases (0.0046), but in data set 2 there is a natural ordering of the factor “size”, and the data is not consistent with this ordering, so here we should be much more cautious with any conclusions than in the case of data set 1.
The flammability of fabric used in children’s sleepwear is tested by placing a flame under a piece of fabric until it ignites. The flame is then removed, and the fabric stops burning. The length of the charred portion of the fabric is measured. In the data set pieces of the same cloth were sent to five different laboratories, which then carried out this experiment eleven times.
head(flammability)
## Labs Length
## 1 Lab 1 2.9
## 2 Lab 1 3.1
## 3 Lab 1 3.1
## 4 Lab 1 3.7
## 5 Lab 1 3.1
## 6 Lab 1 4.2
attach(flammability)
table(Labs)
## Labs
## Lab 1 Lab 2 Lab 3 Lab 4 Lab 5
## 11 11 11 11 11
Lab : Values: “Lab 1”, .., “Lab 5” are text, therefore categorical
Length: Values: 2.9, 3.1… are numerical, therefore quantitative
one categorical and one quantitative variable → ANOVA
Again we start with the boxplot:
bplot(Length,Labs)
There seem to be some differences between labs. No outliers
Next the table of summary statistics:
stat.table(Length, Labs, ndigit=2)
## Sample Size Mean Standard Deviation
## Lab 1 11 3.34 0.45
## Lab 2 11 3.60 0.46
## Lab 3 11 3.30 0.37
## Lab 4 11 3.00 0.29
## Lab 5 11 3.65 0.43
and finally the test:
oneway(Length, Labs)
## p value of test of equal means: p = 0.0033
## Smallest sd: 0.3 Largest sd : 0.5
Assumptions:
normal residuals ok (see normal probability plot)
equal variance: smallest stdev is 0.29, largest is 0.46
3*0.29 = 0.87 > 0.46, ok
The Survey of Study Habits and Attitudes (SSHA) is a psychological test that measures the motivation, attitude towards school, and the study habits of students. Scores range from 0 to 200. We have the scores of 20 male and 18 female first-year students at a private college.
head(studyhabits)
## Gender Score
## 1 Men 108
## 2 Men 140
## 3 Men 114
## 4 Men 91
## 5 Men 180
## 6 Men 115
attach(studyhabits)
table(Gender)
## Gender
## Men Women
## 20 18
Basic question: is there a difference between the study habits of men and women?
Gender : Values: “Men”, “Women” are text, therefore categorical
Score: Values: 115, 126… are numerical, therefore quantitative
one categorical and one quantitative variable → ANOVA
bplot(Score, Gender)
Apparently the women tend to score just a little higher than the men, but the difference is not very large.
stat.table(Score, Gender, ndigit=2)
## Sample Size Mean Standard Deviation
## Men 20 121.25 32.85
## Women 18 141.06 26.44
oneway(Score, Gender)
## p value of test of equal means: p = 0.0495
## Smallest sd: 26.4 Largest sd : 32.9
## A 95% confidence interval for the difference in group means is (-39.3, -0.3)
Assumptions: a. normal residuals b. equal variance: smallest stdev is 26.4, largest is 32.9
3*26.4 = 79.2 > 32.9, ok
In some ways this is a special case of the ANOVA because the categorical variable has only two values (Men and Women). This case is also known as the 2-sample problem.
The main advantage of having a 2 sample problem is that the formulas are much simpler, but because we are not doing any calculations by hand we don’t have to worry about that. There is one type of question we can answer in this situation:
what is a confidence interval for the difference in group means?
In fact the oneway command gives the answer, a confidence interval for the difference of the group means.
Is it harder to maintain your balance while you are concentrating?
Nine elderly (6 men and 3 women) and eight young men were subjects in this experiment. Each subject stood barefoot on a “force platform” and was asked to maintain a stable upright position and to react as quickly as possible to an unpredictable noise by pressing a hand held button. The noise came randomly and the subject concentrated on reacting as quickly as possible. The platform automatically measured how much each subject swayed in millimeters in both the forward/backward and the side-to-side directions.
head(balance)
## forward_backward side_side Age_Group
## 1 19 14 elderly
## 2 30 41 elderly
## 3 20 18 elderly
## 4 19 11 elderly
## 5 29 16 elderly
## 6 25 24 elderly
attach(balance)
We will for now consider the side to side movement.
Age_group : Values: “elderly”, “young” are text, therefore categorical
side-by-side: Values: 14, 41… are numerical, therefore quantitative
one categorical and one quantitative variable→ ANOVA
As before we can start with a boxplot:
bplot(side_side, Age_Group)
clearly the elderly have more movement than the young. This is also seen in the table of summary statistics:
stat.table(side_side, Age_Group)
## Sample Size Mean Standard Deviation
## elderly 9 22.2 10.3
## young 8 15.1 3.9
Let’s say that instead of testing whether the two groups have equal means we would like to find an estimate for the difference in means.
Now obviously we have
\(\overline{X}_\text{Elderly} - \overline{X}_\text{Young} = 22.2-15.1 = 7.1\)
The oneway command will automatically find a 95% confidence interval if the categorical variable has just two values. If we want a 90% confidence interval for the difference of the group means, we can run
oneway(side_side, Age_Group, conf.level=90)
## p value of test of equal means: p = 0.0864
## Smallest sd: 3.9 Largest sd : 10.3
## A 90% confidence interval for the difference in group means is (0.4, 13.8)
The method that finds the confidence interval has the same assumptions as the test:
normal residuals some of the points fall away from the straight line, but only just a little, and that is ok.
equal variance:
Here s1 = 10.3 and s2 = 3.9, so 3*3.9 = 11.7 > 10.3, so this is (just barely) ok
How about the forward_backward movement?
Age_group : Values: “elderly”, “young” are text, therefore categorical
forward_backward: Values: 19, 30… are numerical, therefore quantitative
one categorical and one quantitative variable→ ANOVA
bplot(forward_backward, Age_Group)
we see there is an outlier in the elderly group. Checking the data set we see it is observation #9. Outliers are often problematic. Here are some consequences:
stat.table(forward_backward, Age_Group)
## Sample Size Mean Standard Deviation
## elderly 9 26.3 9.8
## young 8 18.1 4.1
but if we remove obs#9 we find
stat.table(forward_backward[-9], Age_Group[-9])
## Sample Size Mean Standard Deviation
## elderly 8 23.4 4.4
## young 8 18.1 4.1
so both the mean and the standard deviation for the elderly have changed considerably.
By the way, as we learned in 3101, sometimes it is better to use the median as a measure of “average”, and if so we should also use the interquartile range as a measure for “spread”. So we might use a table of the form
stat.table(forward_backward, Age_Group, Mean=FALSE)
## Sample Size Median IQR
## elderly 9 24 9.0
## young 8 17 6.5
Outliers are also an indication of a problem with the normal assumption, as we can see in the normal plot:
oneway(forward_backward, Age_Group)
## p value of test of equal means: p = 0.0435
## Smallest sd: 4.1 Largest sd : 9.8
## A 95% confidence interval for the difference in group means is (0.4, 16.1)
Finally, they influence the result of the 2-sample method. Say we want to find a 95% confidence interval for the difference in forward_backward motion:
oneway(forward_backward, Age_Group)
## p value of test of equal means: p = 0.0435
## Smallest sd: 4.1 Largest sd : 9.8
## A 95% confidence interval for the difference in group means is (0.4, 16.1)
but
oneway(forward_backward[-9], Age_Group[-9])
## p value of test of equal means: p = 0.0264
## Smallest sd: 4.1 Largest sd : 4.4
## A 95% confidence interval for the difference in group means is (0.7, 9.8)
and this interval has an upper limit of about half as large.
Which of these intervals is correct? The first one is wrong because the normal assumption is false, but the second one has the problem that we had to change the data, and that is something we should almost never do. So neither is very good. We will come back to this problem later.
Name | First Nobel Price | Second Nobel Price |
---|---|---|
Linus Pauling | Chemistry 1954 (Studies Of Molecular Structure And The Chemical Bond) | Peace 1962 (Fight Against Atomic Testing) |
Frederick Sanger | Chemistry 1958 (Determining Structure Of Insulin Molecule) | Chemistry 1980 (Biochemical Studies Of Nucleic Acids) |
Marie Sklodowska Curie | Physics 1903 (Discovery Of Radioactivity In Uranium) | Chemistry 1911 (Discovery of the elements radium and polonium) |
John Bardeen | Physics 1952 (Development of the transistor effect) | Physics 1972 (Theory of Superconductivity |
Patients with advanced cancers of the stomach, bronchus, colon, ovary or breast were treated with ascorbate. The purpose of the study was to determine if patient survival differed with respect to the organ affected by the cancer.
Cameron, E. and Pauling, L. (1978) Supplemental ascorbate in the supportive treatment of cancer: re-evaluation of prolongation of survival times in terminal human cancer. Proceedings of the National Academy of Science USA, 75, 4538-4542.
head(cancersurvival)
## Survival Cancer
## 1 124 Stomach
## 2 42 Stomach
## 3 25 Stomach
## 4 45 Stomach
## 5 412 Stomach
## 6 51 Stomach
attach(cancersurvival)
table(Cancer)
## Cancer
## Breast Bronchus Colon Ovary Stomach
## 11 17 17 6 13
Cancer : Values: “Stomach”, .., “Breast” are text, therefore categorical
Survival: Values: 124, 42… are numerical, therefore quantitative
one categorical and one quantitative variable→ ANOVA
bplot(Survival, Cancer)
Here we have a dataset with many outliers. Usually this will mean that the assumption of normal residuals will fail , and indead:
oneway(Survival, Cancer)
## p value of test of equal means: p = 0.000
## Smallest sd: 209.9 Largest sd : 1239
so what do we do now? We will see in a bit.
Note that also the equal variance assumption is violated. That both assumptions are wrong is something that happens quite often!
Let’s return for a moment to the Mothers and Cocain Use data. There we wrote the null hypothesis as follows:
H0: \(\mu_1 = \mu_2 = \mu_3\) (groups have the same means)
and this is the natural null hypothesis if we think in terms of comparing group means. But there is another way to think of this problem:
Does the drug status of the mother influence the length of the baby?
Now if the answer is no then there are no differences between the groups (with respect to the length of the baby!), and so the null hypothesis
H0: \(\mu_1 = \mu_2 = \mu_3\)
is equivalent to saying
H0: “Drug Status” and “Length” are independent (one does not influence the other)
and we again see the same null hypothesis of independence as before in the case of categorical data!