(or better Yule-Simpson’s Paradox)
The famous Berkeley data on sex discrimination. In fall quarter, 1973, there were 8,442 men who applied for admission to graduate school, and 4,321 women.
Source: Freeman, D., Pisani, R., Purves, R. and Adhikiri, A. (1991) Statistics (2nd edition). WW Norton.
First we will look at the overall admittance numbers:
attach(berkeleyadmissions)
berkeleyadmissions[1:2, 1:3]
## Overall Sex Admitted
## 1 Men: 8442 3738
## 2 Women: 4321 1494
Let’s find the percentages:
round(c(3738/8442, 1494/4321)*100, 1)
## [1] 44.3 34.6
which shows a sizable difference in admission rates. We can also do the test:
chi.ind.test(berkeleyadmissions[1:2, 2:3])
## p value of test p=0.000
Now let’s consider the data with the majors
berM <- berkeleyadmissions[ ,5:6]
berM
## Men.Applied Men.Admitted
## 1 825 512
## 2 560 353
## 3 325 120
## 4 417 138
## 5 191 53
## 6 373 22
round(berM[ ,2]/berM[ ,1]*100, 2)
## [1] 62.06 63.04 36.92 33.09 27.75 5.90
berF <- berkeleyadmissions[ ,7:8]
berF
## Women.Applied Women.Admitted
## 1 108 89
## 2 25 17
## 3 593 202
## 4 375 131
## 5 393 94
## 6 341 24
round(berF[,2]/berF[,1]*100, 2)
## [1] 82.41 68.00 34.06 34.93 23.92 7.04
and suddenly any hint of sex discrimination is gone.
A formal hypothesis test for this is possible but outside the scope of this course.
So, we have a paradox:
we found strong evidence (p value=0.00) of a relationship between the gender of an applicant and whether or not they were admitted to the School.
when we broke down the data further by the major of the applicant, this relationship went away.
How is this possible?
Actually, we already know the answer: this is again an issue caused by confusing Cause-Effect with Latent Variable.
There is clearly a relationship between acceptance and gender. But saying it is due to sex discrimination is saying we have a cause - effect relationship. Instead we now know it is because of the latent variable Major.
Can we understand this in the Berkeley Admissions case?
Majors A and B are very popular with the men - 1385 men applied vs. 133 women. Majors A and B are also easy to get in - about 2 out of 3 of the applicants (men or women) get accepted. So although men and women have the same acceptance rate, 10 times as many men are accepted because 10 times as many applied.
Majors C-F are more popular with the women - 1346 men applied vs. 1702 women. But Majors C-F are hard to get in - about 1 in 4 of the applicants (men or women) get accepted. So these majors don’t add much to the total student body.
If in an observational study (as opposed to a clinical trial with random assignments to “treatment” and “control”) we find an relationship (association) between two variables it is usually very hard (impossible?) to decide whether it is due to a cause-effect relationship or whether there is a latent variable responsible for the relationship. In the Berkeley case it turned out that Major was a latent variable. A list of other potential latent variables includes:
and so on.
Note that we could determine here that Majors is a latent variable explaining the relationship between Gender and Acceptance because we had the data to do so! So generally in a study you want to “measure” as many variables as possible because you won’t know ahead of time which of them might turn out to be important.