Chapter 4: Descriptive Statistics

Categorical Data

Case Study: wrinccensus

Consider the variable Gender. Clearly this is categorical data. Usually the first thing one would do is simply count how many of each type there are:

attach(wrinccensus) 
table(Gender)
## Gender
## Female   Male 
##   9510  14281

Case Study: Race and Education

According to a table from the US Department of Education there were 19,980,000 students in US colleges in the fall of 2010. Their breakdown by race was as follows:

## 
## American Indian           Asian           Black        Hispanic 
##          196000         1282000         3039000         2741000 
##           White 
##        12722000

If a table is used for presentation purposes it should usually include a little more information and maybe a better ordering, for example by size. Also, big numbers are often expressed in bigger units:

Number (in 1000) Percentage
White 12722 63.7
Black 3039 15.2
Hispanic 2741 13.7
Asian 1282 6.4
American Indian 196 1.0

In order to compute the percentages we need to divide by the total and multiply by 100. The total is found using the sum command:

x <- c(12722, 3039, 2741, 1282, 196)
round(x/sum(x)*100,1)
## [1] 63.7 15.2 13.7  6.4  1.0

Percentages are usually rounded to one digit behind the decimal, like above.

As we said before, some categorical variables have a built-in (natural) ordering, for example t-shirt size (small, medium, large, x-large) or grades (A,B, …). Such an ordering can also be used.

Case Study: Satisfaction in WRInc

attach(wrinccensus)
tbl <- table(Satisfaction)
tbl
## Satisfaction
##    1    2    3    4    5 
## 3096 2783 3854 7683 6375
perc <- round(tbl/sum(tbl)*100, 1)
perc
## Satisfaction
##    1    2    3    4    5 
## 13.0 11.7 16.2 32.3 26.8

Graphs for Categorical Data

A very popular Choice: Pie Charts

but:

[Death to Pie Charts] (http://www.storytellingwithdata.com/blog/2011/07/death-to-pie-charts)

Much better: Bar charts

Case Study: Race and Education

barchart(race.table)

Case Study: WRInc

barchart(Gender)

Sometime we want to change the ordering of the bars. We can do this with the change.order routine:

barchart(change.order(Gender, new.order=2:1))

Note: to show the graph based on percentages use the argument percentage=“grand”:

barchart(Gender, percentage = "grand")

Case Study: Race and Education

Let’s do the graph based on percentages. Also let’s change order of the bars from largest to smallest. Here we use the table as the argument to barchart, so we need to change the order in the table:

race.table[c(5, 3, 4, 2, 1)]
## 
##           White           Black        Hispanic           Asian 
##        12722000         3039000         2741000         1282000 
## American Indian 
##          196000

so this puts things in the right order.

barchart(race.table[c(5, 3, 4, 2, 1)],
         percentage = "grand")

Case Study: WRInc

attach(wrinccensus)
barchart(Job.Level)

barchart(change.order(Job.Level, 
                      c("7", "6", "5", "4", "3", "2", "1")), 
         percentage = "grand")

Example This is a nice professional table from the website of the CDC (Centers for Disease Control) about the dangers of smoking:

Totals (Frequencies) vs. Percentages

Decide based on the background of the data which number is more relevant/important/interesting.



Some of the things to consider are:

If the data is a random sample from a larger population percentages are often better:

Example of 150 randomly selected people in a phone survey 85 said they would vote for candidate AA in the next election → use 57% instead.

Example in a company with 150 employees 85 said they like their job → use these numbers

For small numbers use frequencies, for large numbers use percentages.

When using percentages it has to be clear what the totals were:

Example an advertisement in the newspaper reads: “Almost 70% of the participants in a scientific study said they prefer Coke over Pepsi”.

Now if this study had 1000 participants and about 700 of those said they like Coke better than Pepsi, that is quite impressive. On the other hand, if it had 3 participants, two of whom liked Coke ( 2 out of 3 = 67%, “almost” 70%) than this may not be so interesting!

When comparing groups of unequal sizes, percentages are almost always necessary:

Example in a survey of the employees in a company they were asked whether they liked there current position:

Yes No
Male 123 88
Female 85 61

At first glance it seems that men are happier with their position than women (123 vs 88) but notice that there are more men than women in total (208 vs 149) so even if they are equally happy we would expect more men who said yes then women. Changing to percentage gives

Happy
Male 59.1
Female 59.0

Notice another advantage of the table with percentages: because there are only the two options yes and no, we need only the percentage of one, the other is simply 100-..

Unappy
Male 40.9
Female 41.0

and in general the smaller a table, the better (as long as it has all the information).

These are just guidelines, there can always be exceptions if there is a good reason.

Rounding

When doing a calculation the rule (generally) is to round to 1 digit more than the data.

By default R always uses 7 digits:

x
##  [1]  4.7  6.6  7.2  7.8  7.8  9.1 10.5 11.4 12.0 13.9 17.2
mean(x)
## [1] 9.836364

and so you need to round this. The data has one digit behind the decimal, so you should have 2:

round(mean(x), 2)
## [1] 9.84

Many of the routines we use do some rounding already, usually to one digit behind the decimal:

stat.table(x)
##   Sample Size Mean Standard Deviation
## x          11  9.8                3.6

and you can use the ndigit argument to change how much:

stat.table(x, ndigit=2)
##   Sample Size Mean Standard Deviation
## x          11 9.84               3.62

Some examples:

##  [1]  68  85  92  95  95 107 114 135 143 156 169
stat.table(x, ndigit=1)
##   Sample Size  Mean Standard Deviation
## x          11 114.5               32.1
##  [1]  5.507  6.638  8.237  8.246 10.457 11.032 11.476 11.557 13.698 13.999
## [11] 14.996
stat.table(x, ndigit=4)
##   Sample Size    Mean Standard Deviation
## x          11 10.5312             3.0834
##  [1]  3100  5500  7000  7200  8500  9000  9100  9500  9500 10400 10700
stat.table(x, ndigit=-1)
##   Sample Size Mean Standard Deviation
## x          10 8140               2270

Special Cases:

  • Percentages are usually rounded to one digit behind the decimal (34.8%)

  • probabilities and proportions are usually rounded to 3 digits (0.348)

Contingency Tables

Case Study: Treatment of Drug Addiction

Cocaine addiction is hard to break. Addicts need cocaine to feel any pleasure, so perhaps giving them an antidepressant drug will help. A 3 year study with 72 chronic cocaine users compared an antidepressant called desipramine with standard treatment for cocaine addiction (lithium) and a placebo. One third of the subjects chosen at random received each drug. After 3 years for each addict it was determined whether he/she was drug free or relapsed.

The data, from D.M. Barnes, “Breaking the Cycle of Addiction”, Science, 241 1988).

head(drugaddiction)
##          Drug Relapse
## 1 Desipramine     Yes
## 2 Desipramine     Yes
## 3 Desipramine     Yes
## 4 Desipramine     Yes
## 5 Desipramine     Yes
## 6 Desipramine     Yes

So here for each subject we have two variables, “Drug” with values “Desipramine”, “Lithium” and “Placebo”, and “Relapsed” with values “Yes” and “No”. Both variables are categorical.

Usually the first thing to do with this type of data is to just count each combination of values and write them up in a contingency table:

attach(drugaddiction)
table(Drug, Relapse)  
##              Relapse
## Drug          No Yes
##   Desipramine 14  10
##   Lithium      6  18
##   Placebo      4  20

If the table is for publication you probably want to add some row and column totals:

No Yes Totals
Desipramine 14 10 24
Lithium 6 18 24
Placebo 4 20 24
Totals 24 48 72

Often instead of the totals (frequencies) these tables might be based on percentages. Here, though, there are three types of percentages:

Percentages based on Grand Total:

tbl <- table(Drug, Relapse)
round(tbl/sum(tbl)*100, 1)
##              Relapse
## Drug            No  Yes
##   Desipramine 19.4 13.9
##   Lithium      8.3 25.0
##   Placebo      5.6 27.8
No Yes Totals
Desipramine 19.4 13.9 33.3
Lithium 8.3 25.0 33.3
Placebo 5.6 27.8 33.3
Totals 33.3 66.7 100.0

Percentages based on Row Totals:

round(tbl/c(24, 24, 24)*100, 1)
##              Relapse
## Drug            No  Yes
##   Desipramine 58.3 41.7
##   Lithium     25.0 75.0
##   Placebo     16.7 83.3
No Yes Totals
Desipramine 58.3 41.7 100
Lithium 25.0 75.0 100
Placebo 16.7 83.3 100
Totals 33.3 66.7 100

Percentages based on Column Totals:

round(tbl[, 1]/24*100, 1)
## Desipramine     Lithium     Placebo 
##        58.3        25.0        16.7
round(tbl[, 2]/48*100, 1)
## Desipramine     Lithium     Placebo 
##        20.8        37.5        41.7
No Yes Totals
Desipramine 58.3 20.8 33.3
Lithium 25.0 37.5 33.3
Placebo 16.7 41.7 33.3
Totals 100.0 100.0 100.0

Which of these 4 tables is the most interesting? It depends on the story behind the data and the result you wish to highlight. Here it is probably the third table which shows clearly that the “relapse rate” for desipramine is much smaller (41.7%) than for either Lithium (75%) or the Placebo (83.3%)

Case Study: Treatment for Hair Loss, Rogaine

attach(rogaine)
rog.tbl <- table(Growth, Group)
rog.tbl
##             Group
## Growth       Control Treatment
##   No Growth      423       301
##   New Vellus     150       172
##   Min Growth     114       178
##   Mod Growth      29        58
##   Den Growth       1         5
  • percentages based on grand total:
round(rog.tbl/sum(rog.tbl)*100, 1)
##             Group
## Growth       Control Treatment
##   No Growth     29.6      21.0
##   New Vellus    10.5      12.0
##   Min Growth     8.0      12.4
##   Mod Growth     2.0       4.1
##   Den Growth     0.1       0.3
  • percentages based on row totals:
clsum <- c(423+301, 150+172, 114+178, 29+58, 1+5)
clsum
## [1] 724 322 292  87   6
round(rog.tbl/clsum*100, 1)
##             Group
## Growth       Control Treatment
##   No Growth     58.4      41.6
##   New Vellus    46.6      53.4
##   Min Growth    39.0      61.0
##   Mod Growth    33.3      66.7
##   Den Growth    16.7      83.3

there is an easier way to do this:

apply(rog.tbl, 1, sum)
##  No Growth New Vellus Min Growth Mod Growth Den Growth 
##        724        322        292         87          6
  • percentages based on column totals:
apply(rog.tbl, 2, sum)
##   Control Treatment 
##       717       714
round(rog.tbl[, 1]/717*100, 1)
##  No Growth New Vellus Min Growth Mod Growth Den Growth 
##       59.0       20.9       15.9        4.0        0.1
round(rog.tbl[, 2]/714*100, 1)
##  No Growth New Vellus Min Growth Mod Growth Den Growth 
##       42.2       24.1       24.9        8.1        0.7

Graphs for Contigency Tabels

The standard graph for this data is a multiple bar chart. It is done with the same command as before.

There are always two depending on which way the bars are grouped together, see

barchart(Drug, Relapse)

barchart(Relapse, Drug)

Again we can base everything on percentages:

barchart(Drug, Relapse, percentage = "row")

barchart(Relapse, Drug, percentage = "col")

Case Study: Rogaine

barchart(Growth, Group)

If we want to change the order of the bars we can again use

barchart(change.order(Growth, c(3, 5, 2, 1, 4)), 
        change.order(Group, 1:2),
        percentage = "row")

not that this makes much sense here!

Case Study: Drownings in Los Angeles

Data is from O’Carroll PW, Alkon E, Weiss B. Drowning mortality in Los Angeles County, 1976 to 1984, JAMA, 1988 Jul 15;260(3):380-3.

Drowning is the fourth leading cause of unintentional injury death in Los Angeles County. We examined data collected by the Los Angeles County Coroner’s Office on drownings that occurred in the county from 1976 through 1984. There were 1587 drownings (1130 males and 457 females) during this nine-year period.

kable.nice(drownings)
Male Female
Private Swimming Pool 488 219
Bathtub 115 132
Ocean 231 40
Freshwater bodies 155 19
Hottubs 16 15
Reservoirs 32 2
Other Pools 46 14
Pails, basins, toilets 7 4
Other 40 12

Let’s do a barchart that shows the differences between the genders and drowning. Because there are many more men than women this has to be done with percentages:

round(drownings[, "Male"]/sum(drownings[, "Male"])*100, 1)
##  Private Swimming Pool                Bathtub                  Ocean 
##                   43.2                   10.2                   20.4 
##      Freshwater bodies                Hottubs             Reservoirs 
##                   13.7                    1.4                    2.8 
##            Other Pools Pails, basins, toilets                  Other 
##                    4.1                    0.6                    3.5
round(drownings[, "Female"]/sum(drownings[, "Female"])*100, 1)
##  Private Swimming Pool                Bathtub                  Ocean 
##                   47.9                   28.9                    8.8 
##      Freshwater bodies                Hottubs             Reservoirs 
##                    4.2                    3.3                    0.4 
##            Other Pools Pails, basins, toilets                  Other 
##                    3.1                    0.9                    2.6

or with a barchart:

barchart(drownings, percentage = "col")