The standard graph for one quantitative variable is the histogram:
attach(wrinccensus)
hplot(Income)
It can be useful to draw a couple of histograms, with different numbers of bins:
hplot(Income, n=25)
Now that we have numbers we can do arithmetic:
According to the 2010 US Census the population of Puerto Rico was 3725789. How does this compare to the rest of the US?
Here is the data:
us.population.2010
## Alabama Alaska Arizona
## 4779736 710231 6392017
## Arkansas California Colorado
## 2915918 37253956 5029196
## Connecticut Delaware District of Columbia
## 3574097 897934 601723
## Florida Georgia Hawaii
## 18801310 9687653 1360301
## Idaho Illinois Indiana
## 1567582 12830632 6483802
## Iowa Kansas Kentucky
## 3046355 2853118 4339367
## Louisiana Maine Maryland
## 4533372 1328361 5773552
## Massachusetts Michigan Minnesota
## 6547629 9883640 5303925
## Mississippi Missouri Montana
## 2967297 5988927 989415
## Nebraska Nevada New Hampshire
## 1826341 2700551 1316470
## New Jersey New Mexico New York
## 8791894 2059179 19378102
## North Carolina North Dakota Ohio
## 9535483 672591 11536504
## Oklahoma Oregon Pennsylvania
## 3751351 3831074 12702379
## Rhode Island South Carolina South Dakota
## 1052567 4625364 814180
## Tennessee Texas Utah
## 6346105 25145561 2763885
## Vermont Virginia Washington
## 625741 8001024 6724540
## West Virginia Wisconsin Wyoming
## 1852994 5686986 563626
So how does Puerto Rico compare? One way to answer this question is to find the average population size:
We want just one number to describe all the numbers in the data set.
How do we calculate an “average”?
Usual answer: mean
Example Three of your friends are 19, 20 and 23 years old. What is their average age?
Answer: (19+20+23)/3 = 62/3 = 20.7
or
x <- c(19, 20, 23)
round(mean(x), 1)
## [1] 20.7
In Statistics the mean is important enough to have it’s own symbol: \(\bar X\) (say: x bar)
We find
mean(us.population.2010)
## [1] 6053834
PR had a population of \(3725789\), so ours is lower than average.
Note According to our rules we should round to one digit behind the decimal. Except that often for large numbers we actually round the other way, so here I might end up using 6,054,000!
Many still consider Babe Ruth the greatest baseball player of all time. In 1919 he moved to the New York Yankees, where he played until 1934. Here are the number of home runs he hit in those years
Year | Homeruns |
---|---|
1920 | 54 |
1921 | 59 |
1922 | 35 |
1923 | 41 |
1924 | 46 |
1925 | 25 |
1926 | 47 |
1927 | 60 |
1928 | 54 |
1929 | 46 |
1930 | 49 |
1931 | 46 |
1932 | 41 |
1933 | 34 |
1934 | 22 |
what was his home run average while with the Yankees?
attach(babe)
round(mean(Homeruns), 1)
## [1] 43.9
Advice
The most important thing you can do in this class (and, more importantly, in life!) after you did some calculation is to ask yourself:
Does my answer make sense?
If you find that the average age of your three friends in the example above is 507.9, you have to know that this answer is wrong.
Example Which of the following are obviously not correct for the mean of Babe Ruths home runs, and why?
There are other methods for computing an “average”, though. For example:
Median: the observation “in the middle” of the ordered data set:
22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60
What if the Babe had left the Yankees a year earlier?
25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60
Median = (46+46)/2 = 46
Using R:
median(Homeruns)
## [1] 46
Let’s find the mean and median of the salaries of the WRInc employees:
mean(Income)
## [1] 33373.13
median(Income)
## [1] 32400
Here there is a difference of almost $1000 between the mean and the median. So which one is the right “average”?
Weights of the bodies of 62 mammals (in kg)
Animal | body.wt.kg | brain.wt.g |
---|---|---|
African elephant | 6654.000 | 5712.00 |
African giant pouched rat | 1.000 | 6.60 |
Arctic Fox | 3.385 | 44.50 |
Arctic ground squirrel | 0.920 | 5.70 |
Asian elephant | 2547.000 | 4603.00 |
Baboon | 10.550 | 179.50 |
Big brown bat | 0.023 | 0.30 |
Brazilian tapir | 160.000 | 169.00 |
Cat | 3.300 | 25.60 |
Chimpanzee | 52.160 | 440.00 |
Chinchilla | 0.425 | 6.40 |
Cow | 465.000 | 423.00 |
Desert hedgehog | 0.550 | 2.40 |
Donkey | 187.100 | 419.00 |
Eastern American mole | 0.075 | 1.20 |
Echidna | 3.000 | 25.00 |
European hedgehog | 0.785 | 3.50 |
Galago | 0.200 | 5.00 |
Genet | 1.410 | 17.50 |
Giant armadillo | 60.000 | 81.00 |
Giraffe | 529.000 | 680.00 |
Goat | 27.660 | 115.00 |
Golden hamster | 0.120 | 1.00 |
Gorilla | 207.000 | 406.00 |
Gray seal | 85.000 | 325.00 |
Gray wolf | 36.330 | 119.50 |
Ground squirrel | 0.101 | 4.00 |
Guinea pig | 1.040 | 5.50 |
Horse | 521.000 | 655.00 |
Jaguar | 100.000 | 157.00 |
Kangaroo | 35.000 | 56.00 |
Lesser short-tailed shrew | 0.005 | 0.14 |
Little brown bat | 0.010 | 0.25 |
Man | 62.000 | 1320.00 |
Mole rat | 0.122 | 3.00 |
Mountain beaver | 1.350 | 8.10 |
Mouse | 0.023 | 0.40 |
Musk shrew | 0.048 | 0.33 |
N. American opossum | 1.700 | 6.30 |
Nine-banded armadillo | 3.500 | 10.80 |
Okapi | 250.000 | 490.00 |
Owl monkey | 0.480 | 15.50 |
Patas monkey | 10.000 | 115.00 |
Phanlanger | 1.620 | 11.40 |
Pig | 192.000 | 180.00 |
Rabbit | 2.500 | 12.10 |
Raccoon | 4.288 | 39.20 |
Rat | 0.280 | 1.90 |
Red fox | 4.235 | 50.40 |
Rhesus monkey | 6.800 | 179.00 |
Rock hyrax (Hetero. b) | 0.750 | 12.30 |
Rock hyrax (Procavia hab) | 3.600 | 21.00 |
Roe deer | 83.000 | 98.20 |
Sheep | 55.500 | 175.00 |
Slow loris | 1.400 | 12.50 |
Star nosed mole | 0.060 | 1.00 |
Tenrec | 0.900 | 2.60 |
Tree hyrax | 2.000 | 12.30 |
Tree shrew | 0.104 | 2.50 |
Vervet | 4.190 | 58.00 |
Water opossum | 3.500 | 3.90 |
Yellow-bellied marmot | 4.050 | 17.00 |
attach(brainsize)
round(mean(body.wt.kg), 3)
## [1] 199.889
round(median(body.wt.kg), 3)
## [1] 3.342
here we find Mean=199.889 and Median=3.342!!!
So what is the AVERAGE???
The reason for this huge difference is obvious: there are two mammals that are much larger than the rest, the African and the Asian elephants. Observations like these that are “unusual” are often called outliers.
Often the mean and the median are very similar if a histogram of the data is symmetric, that is it looks the same from right to left as from left to right:
compared to for example the following, which is called skewed to the right:
Let’s have a look at the years:
hplot(Years)
The histogram is a bit skewed to the right, so we would expect the mean to be higher than the median.
Let’s see:
round(mean(Years), 1)
## [1] 7.6
round(median(Years), 1)
## [1] 7
and it is by a bit.
Whether the mean or the median is a better measure of “average” is NOT a simple question. It often depends on the question asked:
Example 1: what is the weight of a “typical” mammal? Median = 3.34kg
Example 2: say we randomly choose 50 mammals. These are to be transported by ship. How large a ship do we need (what carrying capacity?)
Now if we use the median we find \(50 \times 3.3 = 165\) kg, but if one of the 50 animals is an elephant we are sunk (literally!) So we should use
estimated total weight = 50 × mean weight = \(50 \times 199.9 = 9995\).
Example The government has just released the data for a study of Puerto Rican households. One of the variables was household income
you read in El Nuevo Dia that the mean income in PR is $23100
you hear on the local news that the median income in PR is $20400
Which of these number is better?
Without any explanation what the number will be used for this question has no answer, both the mean and the median are perfectly good ways to calculate an “average”
Misuse of Statistics: Mean vs. Median
Say the owner of a McDonalds wants to compute the “average” hourly wage for the people working there. Do you think she will use the mean or the median? What if it is the Union that wants to find the “average”?
A statistician is standing with one foot in an icebucket and the other foot in a burning fire. He says: on average I feel fine.
A “measure of central tendency” is a good start for describing a set of numbers, but it does not tell the whole story. Consider the the two examples in the next graph:
Here we have two data sets, both have a mean of 10 but they are clearly very different, with different “spreads”. We would like to have some way to measure this “spread-out-ness”.
Range: the first is the range of the observations, defined as Largest-Smallest observation.
Example in the graph above the x data seems to go from about 0 to about 19, so the range is 19-0=19. The y data seems to go from about 7 to about 13, so the range is 13-7=6.
Example For Babe Ruth home runs we find range = 60-22 = 38.
Note Some textbooks and/or computer programs define the range as the pair of numbers (smallest, largest).
range(Homeruns)
## [1] 22 60
This is the most important measure of variation, so it is very important that you learn what it is and what it is telling you.
Consider the following example. Say we have done a survey. We went to a number of locations, and among other things we asked people their age. We found:
Mall: 3 7 13 14 16 18 20 22 23 24 25 27 33 34 40
Plaza: 3 23 26 38 39 40 43 44 46 72
Let’s look at the data with a graph:
Now it seems the variation of the Y’s is a bit larger than the variation of the X’s. But also the mean of Y’s and X’s are different. If we want to concentrate on the variation we can eliminate the differences of the means by subtracting them from each observation:
\(\overline{X} = (3+7+..+40)/15 = 319/15 = 21.27\)
\(\overline{Y} = (3+23+..+72)/10 = 374/10 = 37.40\)
and with this we get:
\(x-\overline{X}\): -18.27 -14.27 -8.27 -7.27 -5.27 -3.27 -1.27 0.73 1.73 2.73 3.73 5.73 11.73 12.73 18.73
\(y-\overline{Y}\): -34.4 -14.4 -11.4 0.6 1.6 2.6 5.6 6.6 8.6 34.6
Let’s look at these numbers with a graph again:
and it is now more obvious that the variation of the Y’s is little bit larger than those of the X’s.
Notice that the mean of the x-\(\overline{X}\) numbers (and of course also the y-\(\overline{Y}\) numbers) is now 0.
Because these new numbers are centered at 0, a larger variation means “farther away from 0”, so how about as a measure of variation the “mean of x-\(\overline{X}\)”, that is
\[\text{Mean} \left( x - \overline{ X} \right)\]
But no, that won’t work because
\[\text{Mean} \left( x - \overline{ X} \right) = 0\] always! (Not obvious? Try it out!)
The problem is that some (actually about half) of the x-\(\overline{X}\) are negative, the other are positive, so in the sum they just cancel out.
So somehow we need to get rid of the - signs. One way to do that would be to use absolute values: |x-\(\overline{X}\)|. It turns out, though, that for some mathematical reasons it is better to use squares:
\[\text{Mean} \left( x - \overline{ X} \right)^2 \] Another change from the “obvious” is that we should divide this by n-1 instead of n, and with this we have the famous formula for the Variance:
\[s^2=\frac{1}{n-1}\sum\left( x - \overline{ X} \right)^2\]
So in essence the variance is the
mean square distance from the sample mean
One problem with having squared everything is that now the units are in “squares”. For example, if our data is the age of people, the variance is age2.
Usually we want everything in the same units, and this is easy to do by taking square roots, and so we finally have the formula for the Standard Deviation:
\[s = \sqrt{\frac{1}{n-1}\sum\left( x - \overline{ X} \right)^2}\]
The R command to find a standard deviation is sd.
what was the standard deviation of his home runs?
round(sd(Homeruns), 1)
## [1] 11.2
Most of the time when you need to find a standard deviation you also want to find the mean, so it is better to use
stat.table(Homeruns)
## Sample Size Mean Standard Deviation
## Homeruns 15 43.9 11.2
Now we have two ways to measure the “spread-out-ness”, range and standard deviation. Unfortunately the two don’t quite work together. For example we have found range=38 and s=11.2 for Babe Ruths home runs. As a rule of thumb we often have
s is close to range/4
range/4 = 38/4 = 9.5, s = 11.2.
Let’s have a look at Satisfaction:
range(Satisfaction)
## [1] 1 5
(5-1)/4
## [1] 1
round(sd(Satisfaction), 1)
## [1] 1.3
Weights of the bodies of 62 mammals (in kg)
We saw before that a few outliers can have a HUGE effect on the mean. The same is true (actually even worse!) for the standard deviation:
sd(body.wt.kg)
## [1] 898.971
sd(body.wt.kg[body.wt.kg<1000])
## [1] 119.4329
If we want to ignore the outliers in the calculation of an average, we can use the median. What can we do if we want to find a measure of variation?
We will see in a little bit!
in the discussion of the standard deviation we saw that if we want to compare two sets of numbers, subtracting the mean is a good idea because then the tsetse are both centered at 0. Now we go a step further and also divide by the standard deviation, getting to the z scores:
\[ z = \frac{x-\overline{X}}{s} \]
the idea here is that no matter what scale the original data is on, the z scores are of the same “size” and can therefore be compared directly.
Note z scores are usually rounded to three digits behind the decimal.
Example say you have taken two exams. In exam 1 you got 13 out of 20 points and in exam 2 you had a 58 out of 100 points. In which exam did you do better?
At first glance you might say exam 1, because if we want to re-scale exam 1 to also have a total of 100 we need to multiply by 5 (20*5 = 100), so your “equivalent” score is
13*5 = 65 > 58
But “doing better” often means doing better with respect to how everyone else did. So let’s say
\(\overline{X}_1 = 10.1\), \(s_1 =4.5\)
\(\overline{X}_2 = 45.7\), \(s_2 =16.5\)
Let’s find your respective z scores: \[ z_1 = \frac{13-10.1}{4.5}=0.64 \\ z_2 = \frac{58-45.7}{16.5}=0.745 \\ \]
and because your z score in exam 2 was higher, that is the one you did better.
Clearly if x is close to the mean, the z score will be 0. It turns out that often z is somewhere between -2 and +2. Both of your z scores are a bit larger than 0 but not much, so they probably are B’s!
What is the z score of PR’s population of \(3725789\)?
We have
mean(us.population.2010)
## [1] 6053834
sd(us.population.2010)
## [1] 6823984
so
\[ \begin{aligned} &\overline{X}=6053834\\ &s=6823984\\ &z = \frac{3725789-6053834}{6823984}=-0.341 \\ \end{aligned} \] and so PR’s z score is -0.341.
Of course we can use R as well:
round((3725789 - mean(us.population.2010))/sd(us.population.2010), 3)
## [1] -0.341
Above we learned the following:
Now we will look at the following question:
Example You read in the newspaper about a study on the age when a criminal committed his first crime. They found that the mean age was 18.3 with a standard deviation of 2.6 years. What is this telling you?
The information “mean age was 18.3”, or with our notation \(\overline{X}=18.3\), is pretty easy to understand - somewhere around age 18 people start to commit crimes. But what about “with a standard deviation of 2.6 years”?
For this we can use the empirical rule
if a data set has a bell-shaped histogram, then 95% of the observations fall into the interval
\[ \left( \overline{X}-2s \text{, }\overline{X}+2s \right) \]
Notice the connection to the z scores. We previously said the z score is usually between -2 and 2, so z=2 would indicate a score almost at the maximum. But then
\[ \begin{aligned} &2 = z = \frac{ x-\overline{X} }{s} \\ &2s = x-\overline{X} \\ &x = \overline{X}+2s \end{aligned} \]
Example Back to the example. We have \(\overline{X}\)=18.3 and s=2.6, so
18.3 - 2*2.6
## [1] 13.1
18.3 + 2*2.6
## [1] 23.5
so 95% of the criminals are between 13.1 and 23.5 years old when they first commit a crime.
Knowing the mean and the standard deviation and using the empirical rule makes it possible to make a guess about the size of the actual observations.
Above we said that s is often close to range/4. The reason for this is explained by the empirical rule: \((\overline{X}-2s, \overline{X}+2s)\) contains 95% of the data, so \(\overline{X}-2s\) should be close to the smallest observation and \(\overline{X}+2s\) should be close to the largest observation. So
range = largest-smallest is close to
\[ (\overline{X}+2s) - (\overline{X}-2s) = 4s \] or
s is close to range/4
Again back to the example with the criminals. For the empirical rule to work the data should have a bell shaped histogram. Do you think this is true for this example?
In some class the professor tells you that in the exam the mean score was 78 with a standard deviation of 12. There are 60 students in the class.
So if the empirical rule holds we would expect
0.95*60
## [1] 57
or 57 of the 60 students to have somewhere between
78+2*12
## [1] 102
and
78-2*12
## [1] 54
points on the exam.
Of course if the highest possible score was 100 that means we should expect someone to have gotten 100!
let’s check whether the empirical rule holds for the income data of the WRInc data set.
hplot(Income)
histogram is reasonably bell-shaped
What does the empirical rule say?
mean(Income) - 2*sd(Income)
## [1] 14524.37
mean(Income) + 2*sd(Income)
## [1] 52221.9
so we should have about 95% of the incomes between $14524 and $52221.
Let’s check:
round(sum(Income>14524 & Income<52221)/length(Income)*100, 1)
## [1] 95.9
looks about right!
consider this data set (just called x)
hplot(x)
Now
left <- mean(x)-2*sd(x)
right <- mean(x)+2*sd(x)
c(left, right)
## [1] -46.35211 45.82627
round(sum(left<x & x<right)/length(x)*100, 1)
## [1] 98.2
and here it is much more than 95%
Another example:
hplot(x)
Now
left <- mean(x)-2*sd(x)
right <- mean(x)+2*sd(x)
c(left, right)
## [1] 0.04769152 0.95814345
round(sum(left<x & x<right)/length(x)*100, 1)
## [1] 98.5
and again it is much more than 95%