Statistics is the Science of Uncertainty, it let’s us learn useful information in situations where there is incomplete information.

Example

(From the book An Introduction to the Bootstrap by Efron and Tibshirani) Below we have the results of a small experiment, in which 7 out of 16 mice were randomly selected to receive a new medical treatment, while the remaining 9 mice were assigned to the control group. The treatment was intended to prolong survival after surgery:

Treatment Control
94 52
197 104
16 146
38 10
99 50
141 31
23 40
27
46

The obvious question is: does the new treatment increase survival times?

How can we answer this question? The first thing we can try is to calculate the mean survival times:

round(c(mean(treatment), mean(control)), 1)
## [1] 86.9 56.2

so the mice in the treatment group lived about 30.7 days longer than those in the control group.

But why the mean? Why not the median or some other measure of average?

Is there some theoretical justification for the mean as the best way to calculate an average? Is it always best?

Very good, but we really don’t care about these 16 mice, they are dead anyway. These 16 mice were just a random sample of the population of all mice who might receive this treatment or this control, and what we really want to know is whether the treatment statistically significantly increases survival.

Some standard terminology:

Population: all of the entities (people, events, things etc.) that are the focus of a study

Census: If all the entities of a population are included in the study.

Sample: any subset of the population

Random sample: a sample found through some randomization (flip of a coin, random numbers on computer etc.)

Simple Random Sample (SRS): each “entity” in the population has an equal chance of being chosen for the sample.

Stratified Sample: First divide population into subgroups, then do a SRS in each subgroup.

Bias: a systematic difference between a sample and its population

Statistically Significant: not due to random chance.

Parameter: any numerical quantity associated with a population

Statistic: any numerical quantity associated with a sample


Here is our question again: from the data we know that the difference of the sample means (a Statistic) is 30.7 days.

What we really want to know is whether the corresponding difference of the population means (a Parameter) is positive.

In other words we want to use the information in the sample to make an inference for the corresponding population.

So, how do we find out whether or not the difference of 30.7 days above is statistically significant? Consider the following boxplot:

In addition to the average of a data set this also gives us an idea of the variation in the data.

So, how can we find the variance of the difference of the mean survival times? First we can find the sample standard deviation:

\[s=\sqrt{\frac1{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}\]

The idea behind this formula is simple:

  • \(X_i-\bar{X}\) is the deviation (distance) of each individual observation from the mean (these are sometimes called the residuals or errors)

  • squaring the residuals gets rid of minus signs (but so would taking absolute values)

  • s would be the square root of the mean of these squared deviations, except we would need to divide by n instead of n-1.

Finding s within each group we get:

round(c(sd(treatment), sd(control)), 2)
## [1] 66.77 42.42
  • But why the sample standard deviation?

Why not some other measure of “variation”?

This is the standard deviation of the individual observations.

From here we can find the standard errors of the sample means with \(s/\sqrt n\) (why?)

round(c(sd(treatment)/sqrt(length(treatment)), 
        sd(control)/sqrt(length(control))), 2)
## [1] 25.24 14.14

Finally we can find the standard error of the difference of the means:

standard error of difference = \(\sqrt{25.24^2+14.14^2)}=28.9\).

Why this formula? This is essentially taking the average of the group standard deviations, so why not use (25.24+14.14)/2=19.67?

So we know we have a sample mean difference of 30.7 with a standard error of 28.9, that is the sample mean difference is 30.7/28.9 =1.05 standard deviations above 0. From probability theory we know that anything within 2 standard deviations might well be due to random fluctuation.

But why 2 standard deviations? Why not 1 or 3 or 4.55?

It seems we can’t say that there is a statistically significant difference between the treatment and the control. Does that mean there is no difference? Actually, no: if we had more data and the difference in means of 30.7 days with standard deviations of about 50 would persist, what sample size would be needed to find a statistically significant difference? The graph shows the standard deviations vs. the sample size (equal for both groups):

so we would need about 26 mice in each group.

Does this mean the treatment is really better than the control, we just didn’t use enough mice in our study? Again, not necessarily, maybe the difference in means of 30.7 would decrease if we used more mice, and we would never pass the threshold of 2 standard deviations. We can’t know that until we run a larger experiment. The above graph just gives us an idea how large such a new experiment should be.