Consider the variable Gender. Clearly this is categorical data. Usually the first thing one would do is simply count how many of each type there are:
attach(wrinccensus)
table(Gender)
## Gender
## Female Male
## 9510 14281
According to a table from the US Department of Education there were 19,980,000 students in US colleges in the fall of 2010. Their breakdown by race was as follows:
##
## American Indian Asian Black Hispanic
## 196000 1282000 3039000 2741000
## White
## 12722000
If a table is used for presentation purposes it should usually include a little more information and maybe a better ordering, for example by size. Also, big numbers are often expressed in bigger units:
Number (in 1000) | Percentage | |
---|---|---|
White | 12722 | 63.7 |
Black | 3039 | 15.2 |
Hispanic | 2741 | 13.7 |
Asian | 1282 | 6.4 |
American Indian | 196 | 1.0 |
In order to compute the percentages we need to divide by the total and multiply by 100. The total is found using the sum command:
x <- c(12722, 3039, 2741, 1282, 196)
round(x/sum(x)*100,1)
## [1] 63.7 15.2 13.7 6.4 1.0
Percentages are usually rounded to one digit behind the decimal, like above.
As we said before, some categorical variables have a built-in (natural) ordering, for example t-shirt size (small, medium, large, x-large) or grades (A,B, …). Such an ordering can also be used.
attach(wrinccensus)
tbl <- table(Satisfaction)
tbl
## Satisfaction
## 1 2 3 4 5
## 3096 2783 3854 7683 6375
perc <- round(tbl/sum(tbl)*100, 1)
perc
## Satisfaction
## 1 2 3 4 5
## 13.0 11.7 16.2 32.3 26.8
A very popular Choice: Pie Charts
but:
[Death to Pie Charts] (http://www.storytellingwithdata.com/blog/2011/07/death-to-pie-charts)
Much better: Bar charts
barchart(race.table)
barchart(Gender)
Sometime we want to change the ordering of the bars. We can do this with the change.order routine:
barchart(change.order(Gender, new.order=2:1))
Note: to show the graph based on percentages use the argument percentage=“grand”:
barchart(Gender, percentage = "grand")
Let’s do the graph based on percentages. Also let’s change order of the bars from largest to smallest. Here we use the table as the argument to barchart, so we need to change the order in the table:
race.table[c(5, 3, 4, 2, 1)]
##
## White Black Hispanic Asian
## 12722000 3039000 2741000 1282000
## American Indian
## 196000
so this puts things in the right order.
barchart(race.table[c(5, 3, 4, 2, 1)],
percentage = "grand")
attach(wrinccensus)
barchart(Job.Level)
barchart(change.order(Job.Level,
c("7", "6", "5", "4", "3", "2", "1")),
percentage = "grand")
Example This is a nice professional table from the website of the CDC (Centers for Disease Control) about the dangers of smoking:
Decide based on the background of the data which number is more relevant/important/interesting.
Some of the things to consider are:
If the data is a random sample from a larger population percentages are often better:
Example of 150 randomly selected people in a phone survey 85 said they would vote for candidate AA in the next election → use 57% instead.
Example in a company with 150 employees 85 said they like their job → use these numbers
For small numbers use frequencies, for large numbers use percentages.
When using percentages it has to be clear what the totals were:
Example an advertisement in the newspaper reads: “Almost 70% of the participants in a scientific study said they prefer Coke over Pepsi”.
Now if this study had 1000 participants and about 700 of those said they like Coke better than Pepsi, that is quite impressive. On the other hand, if it had 3 participants, two of whom liked Coke ( 2 out of 3 = 67%, “almost” 70%) than this may not be so interesting!
When comparing groups of unequal sizes, percentages are almost always necessary:
Example in a survey of the employees in a company they were asked whether they liked there current position:
Yes | No | |
---|---|---|
Male | 123 | 88 |
Female | 85 | 61 |
At first glance it seems that men are happier with their position than women (123 vs 88) but notice that there are more men than women in total (208 vs 149) so even if they are equally happy we would expect more men who said yes then women. Changing to percentage gives
Happy | |
---|---|
Male | 59.1 |
Female | 59.0 |
Notice another advantage of the table with percentages: because there are only the two options yes and no, we need only the percentage of one, the other is simply 100-..
Unappy | |
---|---|
Male | 40.9 |
Female | 41.0 |
and in general the smaller a table, the better (as long as it has all the information).
These are just guidelines, there can always be exceptions if there is a good reason.
When doing a calculation the rule (generally) is to round to 1 digit more than the data.
By default R always uses 7 digits:
x
## [1] 4.7 6.6 7.2 7.8 7.8 9.1 10.5 11.4 12.0 13.9 17.2
mean(x)
## [1] 9.836364
and so you need to round this. The data has one digit behind the decimal, so you should have 2:
round(mean(x), 2)
## [1] 9.84
Many of the routines we use do some rounding already, usually to one digit behind the decimal:
stat.table(x)
## Sample Size Mean Standard Deviation
## x 11 9.8 3.6
and you can use the ndigit argument to change how much:
stat.table(x, ndigit=2)
## Sample Size Mean Standard Deviation
## x 11 9.84 3.62
Some examples:
## [1] 68 85 92 95 95 107 114 135 143 156 169
stat.table(x, ndigit=1)
## Sample Size Mean Standard Deviation
## x 11 114.5 32.1
## [1] 5.507 6.638 8.237 8.246 10.457 11.032 11.476 11.557 13.698 13.999
## [11] 14.996
stat.table(x, ndigit=4)
## Sample Size Mean Standard Deviation
## x 11 10.5312 3.0834
## [1] 3100 5500 7000 7200 8500 9000 9100 9500 9500 10400 10700
stat.table(x, ndigit=-1)
## Sample Size Mean Standard Deviation
## x 10 8140 2270
Special Cases:
Percentages are usually rounded to one digit behind the decimal (34.8%)
probabilities and proportions are usually rounded to 3 digits (0.348)
Cocaine addiction is hard to break. Addicts need cocaine to feel any pleasure, so perhaps giving them an antidepressant drug will help. A 3 year study with 72 chronic cocaine users compared an antidepressant called desipramine with standard treatment for cocaine addiction (lithium) and a placebo. One third of the subjects chosen at random received each drug. After 3 years for each addict it was determined whether he/she was drug free or relapsed.
The data, from D.M. Barnes, “Breaking the Cycle of Addiction”, Science, 241 1988).
head(drugaddiction)
## Drug Relapse
## 1 Desipramine Yes
## 2 Desipramine Yes
## 3 Desipramine Yes
## 4 Desipramine Yes
## 5 Desipramine Yes
## 6 Desipramine Yes
So here for each subject we have two variables, “Drug” with values “Desipramine”, “Lithium” and “Placebo”, and “Relapsed” with values “Yes” and “No”. Both variables are categorical.
Usually the first thing to do with this type of data is to just count each combination of values and write them up in a contingency table:
attach(drugaddiction)
table(Drug, Relapse)
## Relapse
## Drug No Yes
## Desipramine 14 10
## Lithium 6 18
## Placebo 4 20
If the table is for publication you probably want to add some row and column totals:
No | Yes | Totals | |
---|---|---|---|
Desipramine | 14 | 10 | 24 |
Lithium | 6 | 18 | 24 |
Placebo | 4 | 20 | 24 |
Totals | 24 | 48 | 72 |
Often instead of the totals (frequencies) these tables might be based on percentages. Here, though, there are three types of percentages:
Percentages based on Grand Total:
tbl <- table(Drug, Relapse)
round(tbl/sum(tbl)*100, 1)
## Relapse
## Drug No Yes
## Desipramine 19.4 13.9
## Lithium 8.3 25.0
## Placebo 5.6 27.8
No | Yes | Totals | |
---|---|---|---|
Desipramine | 19.4 | 13.9 | 33.3 |
Lithium | 8.3 | 25.0 | 33.3 |
Placebo | 5.6 | 27.8 | 33.3 |
Totals | 33.3 | 66.7 | 100.0 |
Percentages based on Row Totals:
round(tbl/c(24, 24, 24)*100, 1)
## Relapse
## Drug No Yes
## Desipramine 58.3 41.7
## Lithium 25.0 75.0
## Placebo 16.7 83.3
No | Yes | Totals | |
---|---|---|---|
Desipramine | 58.3 | 41.7 | 100 |
Lithium | 25.0 | 75.0 | 100 |
Placebo | 16.7 | 83.3 | 100 |
Totals | 33.3 | 66.7 | 100 |
Percentages based on Column Totals:
round(tbl[, 1]/24*100, 1)
## Desipramine Lithium Placebo
## 58.3 25.0 16.7
round(tbl[, 2]/48*100, 1)
## Desipramine Lithium Placebo
## 20.8 37.5 41.7
No | Yes | Totals | |
---|---|---|---|
Desipramine | 58.3 | 20.8 | 33.3 |
Lithium | 25.0 | 37.5 | 33.3 |
Placebo | 16.7 | 41.7 | 33.3 |
Totals | 100.0 | 100.0 | 100.0 |
Which of these 4 tables is the most interesting? It depends on the story behind the data and the result you wish to highlight. Here it is probably the third table which shows clearly that the “relapse rate” for desipramine is much smaller (41.7%) than for either Lithium (75%) or the Placebo (83.3%)
attach(rogaine)
rog.tbl <- table(Growth, Group)
rog.tbl
## Group
## Growth Control Treatment
## No Growth 423 301
## New Vellus 150 172
## Min Growth 114 178
## Mod Growth 29 58
## Den Growth 1 5
round(rog.tbl/sum(rog.tbl)*100, 1)
## Group
## Growth Control Treatment
## No Growth 29.6 21.0
## New Vellus 10.5 12.0
## Min Growth 8.0 12.4
## Mod Growth 2.0 4.1
## Den Growth 0.1 0.3
clsum <- c(423+301, 150+172, 114+178, 29+58, 1+5)
clsum
## [1] 724 322 292 87 6
round(rog.tbl/clsum*100, 1)
## Group
## Growth Control Treatment
## No Growth 58.4 41.6
## New Vellus 46.6 53.4
## Min Growth 39.0 61.0
## Mod Growth 33.3 66.7
## Den Growth 16.7 83.3
there is an easier way to do this:
apply(rog.tbl, 1, sum)
## No Growth New Vellus Min Growth Mod Growth Den Growth
## 724 322 292 87 6
apply(rog.tbl, 2, sum)
## Control Treatment
## 717 714
round(rog.tbl[, 1]/717*100, 1)
## No Growth New Vellus Min Growth Mod Growth Den Growth
## 59.0 20.9 15.9 4.0 0.1
round(rog.tbl[, 2]/714*100, 1)
## No Growth New Vellus Min Growth Mod Growth Den Growth
## 42.2 24.1 24.9 8.1 0.7
The standard graph for this data is a multiple bar chart. It is done with the same command as before.
There are always two depending on which way the bars are grouped together, see
barchart(Drug, Relapse)
barchart(Relapse, Drug)
Again we can base everything on percentages:
barchart(Drug, Relapse, percentage = "row")
barchart(Relapse, Drug, percentage = "col")
barchart(Growth, Group)
If we want to change the order of the bars we can again use
barchart(change.order(Growth, c(3, 5, 2, 1, 4)),
change.order(Group, 1:2),
percentage = "row")
not that this makes much sense here!
Data is from O’Carroll PW, Alkon E, Weiss B. Drowning mortality in Los Angeles County, 1976 to 1984, JAMA, 1988 Jul 15;260(3):380-3.
Drowning is the fourth leading cause of unintentional injury death in Los Angeles County. We examined data collected by the Los Angeles County Coroner’s Office on drownings that occurred in the county from 1976 through 1984. There were 1587 drownings (1130 males and 457 females) during this nine-year period.
kable.nice(drownings)
Male | Female | |
---|---|---|
Private Swimming Pool | 488 | 219 |
Bathtub | 115 | 132 |
Ocean | 231 | 40 |
Freshwater bodies | 155 | 19 |
Hottubs | 16 | 15 |
Reservoirs | 32 | 2 |
Other Pools | 46 | 14 |
Pails, basins, toilets | 7 | 4 |
Other | 40 | 12 |
Let’s do a barchart that shows the differences between the genders and drowning. Because there are many more men than women this has to be done with percentages:
round(drownings[, "Male"]/sum(drownings[, "Male"])*100, 1)
## Private Swimming Pool Bathtub Ocean
## 43.2 10.2 20.4
## Freshwater bodies Hottubs Reservoirs
## 13.7 1.4 2.8
## Other Pools Pails, basins, toilets Other
## 4.1 0.6 3.5
round(drownings[, "Female"]/sum(drownings[, "Female"])*100, 1)
## Private Swimming Pool Bathtub Ocean
## 47.9 28.9 8.8
## Freshwater bodies Hottubs Reservoirs
## 4.2 3.3 0.4
## Other Pools Pails, basins, toilets Other
## 3.1 0.9 2.6
or with a barchart:
barchart(drownings, percentage = "col")