Percentiles and Boxplots

Percentiles (Measures of Location)

Case Study: Population Sizes of States and Puerto Rico

According to the 2010 US Census the population of Puerto Rico was 3725789.

us.population.2010

##               Alabama                Alaska               Arizona 
##               4779736                710231               6392017 
##              Arkansas            California              Colorado 
##               2915918              37253956               5029196 
##           Connecticut              Delaware  District of Columbia 
##               3574097                897934                601723 
##               Florida               Georgia                Hawaii 
##              18801310               9687653               1360301 
##                 Idaho              Illinois               Indiana 
##               1567582              12830632               6483802 
##                  Iowa                Kansas              Kentucky 
##               3046355               2853118               4339367 
##             Louisiana                 Maine              Maryland 
##               4533372               1328361               5773552 
##         Massachusetts              Michigan             Minnesota 
##               6547629               9883640               5303925 
##           Mississippi              Missouri               Montana 
##               2967297               5988927                989415 
##              Nebraska                Nevada         New Hampshire 
##               1826341               2700551               1316470 
##            New Jersey            New Mexico              New York 
##               8791894               2059179              19378102 
##        North Carolina          North Dakota                  Ohio 
##               9535483                672591              11536504 
##              Oklahoma                Oregon          Pennsylvania 
##               3751351               3831074              12702379 
##          Rhode Island        South Carolina          South Dakota 
##               1052567               4625364                814180 
##             Tennessee                 Texas                  Utah 
##               6346105              25145561               2763885 
##               Vermont              Virginia            Washington 
##                625741               8001024               6724540 
##         West Virginia             Wisconsin               Wyoming 
##               1852994               5686986                563626

Previously we found the mean population size for the states to be 6053834, so that PR’s population was lower than average. Here is a different way to compare PR to the states: If we order them from smallest to largest and add PR we find:

563626 601723 625741 672591 710231 814180 897934 989415 1052567 1316470 1328361 1360301 1567582 1826341 1852994 2059179 2700551 2763885 2853118 2915918 2967297 3046355 3574097 3725789 3751351 3831074 4339367 4533372 4625364 4779736 5029196 5303925 5686986 5773552 5988927 6346105 6392017 6483802 6547629 6724540 8001024 8791894 9535483 9687653 9883640 11536504 12702379 12830632 18801310 19378102 25145561 37253956

so PR’s population is the 24th. So of the 52 numbers 23 are smaller than PR’s, 23 out of 52 is 23/52*100% = 44.2%. We say that

PR is at the 44.2nd percentile.

Definition:

The $p^{th}$ percentile of a data set is the value that has at most $p\%$ of the data below it and at most $(100-p)\%$ above it.

Example consider the first employee in our WRInc data set. She has an income of $22800.

attach(wrinccensus)
sum(Income<22800)

## [1] 2880

shows that there are 2880 employees with a lower income. 2880 out of 23791 means she is at the $2880/23791 \times 100 = 12.1^{st}$ percentile.

So $12.1\%$ have an income less than her and $(100-12.1)\%=87.9\%$ have an income higher than her.

The R command to find a percentile is quantile.

Case Study: Babe Ruth’s Homeruns

Find the 67^th percentile of the data

attach(babe)
quantile(Homeruns, 0.67)

##   67% 
## 47.76

Case Study: WRinc

Find the $10^{th}$ and the $90^{th}$ percentile of the WRinc incomes:

quantile(Income, c(0.1, 0.9))

##   10%   90% 
## 22000 45900

Quartiles, Five-Number Summary and IQR

The quartiles of a data set are defined as

1^st quartile Q₁ = 25^th percentile

3^rd quartile Q₃ = 75^th percentile

Using these we can also find the Interquartile Range

IQR = Q₃ - Q₁

and the five number summary:

Minimum | Q₁ | Median | Q₃ | Maximum

Case Study: Babe Ruth’s Homeruns

fivenumber(Homeruns)

##  Minimum Q1 Median   Q3 Maximum
##       22 38     46 51.5      60
## IQR =  13.5

Case Study: WRInc

Find the 5-number-summary and the IQR of the incomes of WRInc:

fivenumber(Income)

##  Minimum    Q1 Median    Q3 Maximum
##     8000 26600  32400 39200   88400
## IQR =  12600

What is the meaning of these percentiles?

Q₁ = P₂₅ = $26600, so 25% (or 1 in 4) of the employees make less than $26600.
Median = $32400, so half of the employees make less than $32400, half make more.
Q₃ = P₇₅ = $39200, so 25% (or 1 in 4) of the employees make more than $39200.

What is the meaning of IQR? Actually it is a 3^rd way to calculate a measure of variation, after the range and the standard deviation.

Example The standard deviation of the incomes is s = 9424. Now IQR = 12600.

Case Study: WRInc

Find the 5-number-summary and the IQR of the Years of WRInc:

fivenumber(Years)

##  Minimum Q1 Median Q3 Maximum
##        0  4      7 11      43
## IQR =  7

Now we have several formulas (methods) for finding an “average” (mean, median) and a variation (range/4, s, IQR). How do you decide which to use?

Use the range only if you can’t find either of the other two, for example if you only know the smallest and the largest observation, or if you have to do a quick calculation in your head.
decide whether to use mean or median as we discussed before.
If you use the mean, also use the standard deviation. If you use the median, use IQR.

Case Study: Weights of Mammals

Weights of the bodies of 62 mammals (in kg)

We saw before that a few outliers can have a HUGE effect on the standard deviation:

attach(brainsize) 
sd(body.wt.kg)

## [1] 898.971

sd(body.wt.kg[body.wt.kg<1000])

## [1] 119.4329

If we want to ignore the outliers we can use the median. But then we should also ignore the outliers in the calculation of a measure of variation, which happens if we use the IQR:

IQR(body.wt.kg)

## [1] 54.065

IQR(body.wt.kg[body.wt.kg<1000])

## [1] 39.755

Boxplot

From the five number summary we can construct another graph for quantitative data, the boxplot:

Case Study: Babe Ruth’s Homeruns

bplot(Homeruns)

Note that by its definition the box contains 50% of the data

Case Study: Simon Newcomb’s Measurements of the Speed of Light

Simon Newcomb made a series of measurements of the speed of light between July and September 1880. He measured the time in seconds that a light signal took to pass from his laboratory on the Potomac River to a mirror at the base of the Washington Monument and back, a total distance of 7400m. His first measurement was 0.000024828 seconds, or 24,828 nanoseconds (10⁹ nanoseconds = 1 second).

attach(newcomb)
bplot(Measurement, orientation="Horizontal")

Observations marked with a dot are (possible) outliers, that is “unusual” observations.

One of the effects of outliers is that we then often get a difference between the mean and the median:

mean(Measurement)

## [1] 24826.21

median(Measurement)

## [1] 24827

How did Newcomb handle this problem? After careful consideration he dropped the 24756 and found the mean of the other 65 observations (24827.3), an answer much closer to the median than the original mean. Eliminating data from the analysis is something that should be done with great care! At very least one needs to be honest about this and discuss the issue, just like Newcomb.

Notice also that the effect of outliers is even greater on the standard deviation, but not so much on the IQR

sd(Measurement)

## [1] 10.74532

sd(Measurement[Measurement>24756])

## [1] 6.249308

IQR(Measurement)

## [1] 6.75

IQR(Measurement[Measurement>24756])

## [1] 7

Case Study: Wrinccensus

bplot(Income, orientation="Horizontal")

bplot(Years, orientation="Horizontal")

both variables have many outliers

Case Study: Mammals

attach(brainsize)
bplot(body.wt.kg, orientation="Horizontal")

at least two, maybe more substantial outliers

The handling of outliers is one of the more difficult and dangerous jobs in Statistics:

Case Study: Ozone Hole over South Pole

In 1985 British scientists reported a hole in the ozone layer of the earth’s atmosphere over the South Pole.

This news is disturbing, because ozone protects us from cancer-causing ultraviolet radiation. The British report was at first disregarded, because it was based on ground instruments looking up.

More comprehensive observations from satellite instruments looking down had shown nothing unusual.

Then, examination of the satellite data revealed that the South Pole ozone readings were so low that the computer software used to analyze the data had automatically set these values aside as suspicious outliers.

Readings dating back to 1979 were reanalyzed and showed a large and growing hole in the ozone layer that is unexplained and considered dangerous.

Computers analyzing large volumes of data are often programmed to suppress outliers as protection against errors in the data. As the example of the hole in the ozone illustrates, suppressing an outlier without investigating it can conceal valuable information.

(From More and McCabe)

Sometimes it is the outliers that are the most interesting feature of a data set!

There is a nice alternative to the boxplot called the violinplot:

bplot(Income)

bplot(Income, do.violin = TRUE)

In addition to the box this also gives us some information on how many observations we have at various levels.

Case Study: Drug Use of Mothers and the Health of the Newborn

Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use. The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn.

Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff

Obstetrics and Gynecology 1989;74:882-885

Here we have two variables, Length (quantitative) and Status (categorical). Another way (and in many ways a more natural way) to look at this data is as quantitative measurements from different groups. For this type of data we might first compute the summary statistics for each group separately:

attach(mothers)
stat.table(Length, Status)

##                 Sample Size Mean Standard Deviation
## Drug Free                39 51.1                2.9
## First Trimester          19 49.3                2.5
## Throughout               36 48.0                3.6

Note that the discussion on Mean vs. Median still holds: If there are outliers it might be better to use the median and IQR here.

The standard graph for this data is a multiple boxplot. Note that all the boxes are on the same scale!

bplot(Length,Status)

Case Study:WRInc

Recall that when we looked at the relationship of job level and income we did a scatterplot but then noticed that to be a bad graph. So here is a better one:

bplot(Income, Job.Level)

and this shows much clearer the increase in income.