According to the 2010 US Census the population of Puerto Rico was 3725789.
us.population.2010
## Alabama Alaska Arizona
## 4779736 710231 6392017
## Arkansas California Colorado
## 2915918 37253956 5029196
## Connecticut Delaware District of Columbia
## 3574097 897934 601723
## Florida Georgia Hawaii
## 18801310 9687653 1360301
## Idaho Illinois Indiana
## 1567582 12830632 6483802
## Iowa Kansas Kentucky
## 3046355 2853118 4339367
## Louisiana Maine Maryland
## 4533372 1328361 5773552
## Massachusetts Michigan Minnesota
## 6547629 9883640 5303925
## Mississippi Missouri Montana
## 2967297 5988927 989415
## Nebraska Nevada New Hampshire
## 1826341 2700551 1316470
## New Jersey New Mexico New York
## 8791894 2059179 19378102
## North Carolina North Dakota Ohio
## 9535483 672591 11536504
## Oklahoma Oregon Pennsylvania
## 3751351 3831074 12702379
## Rhode Island South Carolina South Dakota
## 1052567 4625364 814180
## Tennessee Texas Utah
## 6346105 25145561 2763885
## Vermont Virginia Washington
## 625741 8001024 6724540
## West Virginia Wisconsin Wyoming
## 1852994 5686986 563626
Previously we found the mean population size for the states to be 6053834, so that PR’s population was lower than average. Here is a different way to compare PR to the states: If we order them from smallest to largest and add PR we find:
563626 601723 625741 672591 710231 814180 897934 989415 1052567 1316470 1328361 1360301 1567582 1826341 1852994 2059179 2700551 2763885 2853118 2915918 2967297 3046355 3574097 3725789 3751351 3831074 4339367 4533372 4625364 4779736 5029196 5303925 5686986 5773552 5988927 6346105 6392017 6483802 6547629 6724540 8001024 8791894 9535483 9687653 9883640 11536504 12702379 12830632 18801310 19378102 25145561 37253956
so PR’s population is the 24th. So of the 52 numbers 23 are smaller than PR’s, 23 out of 52 is 23/52*100% = 44.2%. We say that
PR is at the 44.2nd percentile.
Definition:
The \(p^{th}\) percentile of a data set is the value that has at most \(p\%\) of the data below it and at most \((100-p)\%\) above it.
Example consider the first employee in our WRInc data set. She has an income of $22800.
attach(wrinccensus)
sum(Income<22800)
## [1] 2880
shows that there are 2880 employees with a lower income. 2880 out of 23791 means she is at the \(2880/23791 \times 100 = 12.1^{st}\) percentile.
So \(12.1\%\) have an income less than her and \((100-12.1)\%=87.9\%\) have an income higher than her.
The R command to find a percentile is quantile.
Find the 67th percentile of the data
attach(babe)
quantile(Homeruns, 0.67)
## 67%
## 47.76
Find the \(10^{th}\) and the \(90^{th}\) percentile of the WRinc incomes:
quantile(Income, c(0.1, 0.9))
## 10% 90%
## 22000 45900
The quartiles of a data set are defined as
1st quartile Q1 = 25th percentile
3rd quartile Q3 = 75th percentile
Using these we can also find the Interquartile Range
IQR = Q3 - Q1
and the five number summary:
Minimum | Q1 | Median | Q3 | Maximum
fivenumber(Homeruns)
## Minimum Q1 Median Q3 Maximum
## 22 38 46 51.5 60
## IQR = 13.5
Find the 5-number-summary and the IQR of the incomes of WRInc:
fivenumber(Income)
## Minimum Q1 Median Q3 Maximum
## 8000 26600 32400 39200 88400
## IQR = 12600
What is the meaning of these percentiles?
Q1 = P25 = $26600, so 25% (or 1 in 4) of the employees make less than $26600.
Median = $32400, so half of the employees make less than $32400, half make more.
Q3 = P75 = $39200, so 25% (or 1 in 4) of the employees make more than $39200.
What is the meaning of IQR? Actually it is a 3rd way to calculate a measure of variation, after the range and the standard deviation.
Example The standard deviation of the incomes is s = 9424. Now IQR = 12600.
Find the 5-number-summary and the IQR of the Years of WRInc:
fivenumber(Years)
## Minimum Q1 Median Q3 Maximum
## 0 4 7 11 43
## IQR = 7
Now we have several formulas (methods) for finding an “average” (mean, median) and a variation (range/4, s, IQR). How do you decide which to use?
Use the range only if you can’t find either of the other two, for example if you only know the smallest and the largest observation, or if you have to do a quick calculation in your head.
decide whether to use mean or median as we discussed before.
If you use the mean, also use the standard deviation. If you use the median, use IQR.
Weights of the bodies of 62 mammals (in kg)
We saw before that a few outliers can have a HUGE effect on the standard deviation:
attach(brainsize)
sd(body.wt.kg)
## [1] 898.971
sd(body.wt.kg[body.wt.kg<1000])
## [1] 119.4329
If we want to ignore the outliers we can use the median. But then we should also ignore the outliers in the calculation of a measure of variation, which happens if we use the IQR:
IQR(body.wt.kg)
## [1] 54.065
IQR(body.wt.kg[body.wt.kg<1000])
## [1] 39.755
From the five number summary we can construct another graph for quantitative data, the boxplot:
bplot(Homeruns)
Note that by its definition the box contains 50% of the data
Simon Newcomb made a series of measurements of the speed of light between July and September 1880. He measured the time in seconds that a light signal took to pass from his laboratory on the Potomac River to a mirror at the base of the Washington Monument and back, a total distance of 7400m. His first measurement was 0.000024828 seconds, or 24,828 nanoseconds (109 nanoseconds = 1 second).
attach(newcomb)
bplot(Measurement, orientation="Horizontal")
Observations marked with a dot are (possible) outliers, that is “unusual” observations.
One of the effects of outliers is that we then often get a difference between the mean and the median:
mean(Measurement)
## [1] 24826.21
median(Measurement)
## [1] 24827
How did Newcomb handle this problem? After careful consideration he dropped the 24756 and found the mean of the other 65 observations (24827.3), an answer much closer to the median than the original mean. Eliminating data from the analysis is something that should be done with great care! At very least one needs to be honest about this and discuss the issue, just like Newcomb.
Notice also that the effect of outliers is even greater on the standard deviation, but not so much on the IQR
sd(Measurement)
## [1] 10.74532
sd(Measurement[Measurement>24756])
## [1] 6.249308
IQR(Measurement)
## [1] 6.75
IQR(Measurement[Measurement>24756])
## [1] 7
bplot(Income, orientation="Horizontal")
bplot(Years, orientation="Horizontal")
both variables have many outliers
attach(brainsize)
bplot(body.wt.kg, orientation="Horizontal")
at least two, maybe more substantial outliers
The handling of outliers is one of the more difficult and dangerous jobs in Statistics:
In 1985 British scientists reported a hole in the ozone layer of the earth’s atmosphere over the South Pole.
This news is disturbing, because ozone protects us from cancer-causing ultraviolet radiation. The British report was at first disregarded, because it was based on ground instruments looking up.
More comprehensive observations from satellite instruments looking down had shown nothing unusual.
Then, examination of the satellite data revealed that the South Pole ozone readings were so low that the computer software used to analyze the data had automatically set these values aside as suspicious outliers.
Readings dating back to 1979 were reanalyzed and showed a large and growing hole in the ozone layer that is unexplained and considered dangerous.
Computers analyzing large volumes of data are often programmed to suppress outliers as protection against errors in the data. As the example of the hole in the ozone illustrates, suppressing an outlier without investigating it can conceal valuable information.
(From More and McCabe)
Sometimes it is the outliers that are the most interesting feature of a data set!
There is a nice alternative to the boxplot called the violinplot:
bplot(Income)
bplot(Income, do.violin = TRUE)
In addition to the box this also gives us some information on how many observations we have at various levels.
Chasnoff and others obtained several measures and responses for newborn babies whose mothers were classified by degree of cocaine use. The study was conducted in the Perinatal Center for Chemical Dependence at Northwestern University Medical School. The measurement given here is the length of the newborn.
Source: Cocaine abuse during pregnancy: correlation between prenatal care and perinatal outcome Authors: SN MacGregor, LG Keith, JA Bachicha, and IJ Chasnoff
Obstetrics and Gynecology 1989;74:882-885
Here we have two variables, Length (quantitative) and Status (categorical). Another way (and in many ways a more natural way) to look at this data is as quantitative measurements from different groups. For this type of data we might first compute the summary statistics for each group separately:
attach(mothers)
stat.table(Length, Status)
## Sample Size Mean Standard Deviation
## Drug Free 39 51.1 2.9
## First Trimester 19 49.3 2.5
## Throughout 36 48.0 3.6
Note that the discussion on Mean vs. Median still holds: If there are outliers it might be better to use the median and IQR here.
The standard graph for this data is a multiple boxplot. Note that all the boxes are on the same scale!
bplot(Length,Status)
Recall that when we looked at the relationship of job level and income we did a scatterplot but then noticed that to be a bad graph. So here is a better one:
bplot(Income, Job.Level)
and this shows much clearer the increase in income.