Chapter 6: Statistical Inference

Population - Sample

Now we can go back to Statistics. To begin with, recall the following:

Population: all of the entities (people, events, things etc.) that are the focus of a study

Sample: any subset of the population

Parameter: any numerical quantity associated with a population

Statistic: any numerical quantity associated with a sample.

After our discussion of probability we can now be a little bit more precise

Example

Say we roll a fair die until the first time we get a six. We always are in one of two situations:

Theoretical

We have not actually rolled any die, maybe we don’t even have a die, we are just studying this exercise theoretically. That is we are studying the population of all possible outcomes of the experiment “Number of rolls of a fair die until a six”. We now have a theoretical description of this experiment., namely its distribution.

For the first couple of values of the number of rolls needed we find

Rolls Needed	Probability
1	0.167
2	0.139
3	0.116
4	0.096
5	0.080
6	0.067

There are formulas for all sorts of numbers for various distributions. For ours we have

\(\mu=6.0\)
\(\sigma= 5.48\)
third quartile Q₃=8
95^th Percentile P₉₅=16

and so on.

But because they are computed for the whole population they are parameters.

Practical

On the other hand we can study this exercise by actually rolling a fair die many times and observing what happens. Actually (a lot faster and less work!) we can do a simulation. The distribution describing this experiment is called a geometric and we can generate data with rgeom(B, 1/6)+1

B <- 10000
x <- rgeom(B,1/6)+1

Now we can use this data set to find probabilities:

round(table(x)/B, 3)

## x
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 0.165 0.130 0.116 0.100 0.078 0.069 0.060 0.045 0.043 0.033 0.028 0.021 
##    13    14    15    16    17    18    19    20    21    22    23    24 
## 0.018 0.017 0.014 0.008 0.011 0.008 0.006 0.006 0.005 0.003 0.002 0.002 
##    25    26    27    28    29    30    31    32    33    34    35    36 
## 0.002 0.002 0.002 0.001 0.002 0.001 0.001 0.000 0.000 0.000 0.000 0.000 
##    37    39    41    44    45 
## 0.000 0.000 0.000 0.000 0.000

and of course we can find the summary statistics:

sample mean

round(mean(x), 2)

## [1] 6.05

sample standard deviation s

round(sd(x), 2)

## [1] 5.45

sample third quartile Q₃ and 95^th Percentile P₉₅

round(quantile(x, c(0.75, 0.95)), 2)

## 75% 95% 
##   8  17

Because these are computed from a sample they are statistics

Population - Sample

The real powerful idea here is to combine these two approaches: Say we have a die that we suspect is not a fair die and we wish to test this. So we roll the die and then compare the practical results with the theoretical ones. For our die we find (for one run of the simulation)

Rolls Needed	Theory	Sample
1	0.167	0.165
2	0.139	0.130
3	0.116	0.116
4	0.096	0.100
5	0.080	0.078
6	0.067	0.069

or we can do this by looking at some summaries:

Mean: \(\mu=6\), \(\overline{X}=6.05\)
Standard Deviation: \(\sigma=5.48\), \(s=5.45\)
Third Quartile: \(Q_3\)(Theory) \(=8\), \(Q_3\)(Sample)\(=8\)
\(95^{th}\) Percentile: \(P_{95}\)(Theory) \(=18\), \(P_{95}\)(Sample)\(=17\)

It seems our die is pretty much a fair die.

The most important feature of the scientific method is that any scientific theory has to be falsifyable, that is it has to be possible to carry out experiments and compare the results of these experiments to predictions made by the theory. If they agree, the theory looks good, if not we need to change the theory or even find a new one. But how do we decide whether or not they “agree”? That is one place where Statistics comes into play.

Theory: our die is fair
predictions made using this theory: P(X=1)=0.167, \(\mu=6.0\), …
carry out an experiment (6, 1, 7, 4, 1, …)
compare predictions with results of the experiment P(X=1)=0.168 (theory), P(X=1)=0.182 (experiment) \(\mu=6\), \(\overline{X}=6.05\)
do they agree or is the theory bad?

Note: most “theories” we look at are not big scientific theories but simple things like “Our new drug works better than the currently available one”. Well, if this a new drug for cancer maybe it is a pretty big theory after all!

Case Study: Normal Distribution

Say a certain population is known to be described by a normal distribution with mean 100 and standard deviation 30. From textbooks we can find the following information:

\(\mu =100\)
\(\sigma = 30\)
\(Q_1 = 79.8\)
Prob(X<80) = 0.252
Prob(90<X<110) = 0.383

We can get simulated data from a normal distribution with mean 100 and standard deviation 30 with

x <- rnorm(10000, 100, 30)
round(mean(x), 1)

## [1] 99.8

round(sd(x), 1)

## [1] 29.7

round(quantile(x, 0.25), 1)

## 25% 
##  80

sum(x<80)/10000

## [1] 0.2501

sum(85<x &x<115)/10000

## [1] 0.3855