Now we can go back to Statistics. To begin with, recall the following:
Population: all of the entities (people, events, things etc.) that are the focus of a study
Sample: any subset of the population
Parameter: any numerical quantity associated with a population
Statistic: any numerical quantity associated with a sample.
After our discussion of probability we can now be a little bit more precise
Say we roll a fair die until the first time we get a six. We always are in one of two situations:
We have not actually rolled any die, maybe we don’t even have a die, we are just studying this exercise theoretically. That is we are studying the population of all possible outcomes of the experiment “Number of rolls of a fair die until a six”. We now have a theoretical description of this experiment., namely its distribution.
For the first couple of values of the number of rolls needed we find
Rolls Needed | Probability |
---|---|
1 | 0.167 |
2 | 0.139 |
3 | 0.116 |
4 | 0.096 |
5 | 0.080 |
6 | 0.067 |
There are formulas for all sorts of numbers for various distributions. For ours we have
and so on.
But because they are computed for the whole population they are parameters.
On the other hand we can study this exercise by actually rolling a fair die many times and observing what happens. Actually (a lot faster and less work!) we can do a simulation. The distribution describing this experiment is called a geometric and we can generate data with rgeom(B, 1/6)+1
B <- 10000
x <- rgeom(B,1/6)+1
Now we can use this data set to find probabilities:
round(table(x)/B, 3)
## x
## 1 2 3 4 5 6 7 8 9 10 11 12
## 0.165 0.130 0.116 0.100 0.078 0.069 0.060 0.045 0.043 0.033 0.028 0.021
## 13 14 15 16 17 18 19 20 21 22 23 24
## 0.018 0.017 0.014 0.008 0.011 0.008 0.006 0.006 0.005 0.003 0.002 0.002
## 25 26 27 28 29 30 31 32 33 34 35 36
## 0.002 0.002 0.002 0.001 0.002 0.001 0.001 0.000 0.000 0.000 0.000 0.000
## 37 39 41 44 45
## 0.000 0.000 0.000 0.000 0.000
and of course we can find the summary statistics:
round(mean(x), 2)
## [1] 6.05
round(sd(x), 2)
## [1] 5.45
round(quantile(x, c(0.75, 0.95)), 2)
## 75% 95%
## 8 17
Because these are computed from a sample they are statistics
The real powerful idea here is to combine these two approaches: Say we have a die that we suspect is not a fair die and we wish to test this. So we roll the die and then compare the practical results with the theoretical ones. For our die we find (for one run of the simulation)
Rolls Needed | Theory | Sample |
---|---|---|
1 | 0.167 | 0.165 |
2 | 0.139 | 0.130 |
3 | 0.116 | 0.116 |
4 | 0.096 | 0.100 |
5 | 0.080 | 0.078 |
6 | 0.067 | 0.069 |
or we can do this by looking at some summaries:
It seems our die is pretty much a fair die.
The most important feature of the scientific method is that any scientific theory has to be falsifyable, that is it has to be possible to carry out experiments and compare the results of these experiments to predictions made by the theory. If they agree, the theory looks good, if not we need to change the theory or even find a new one. But how do we decide whether or not they “agree”? That is one place where Statistics comes into play.
predictions made using this theory: P(X=1)=0.167, \(\mu=6.0\), …
compare predictions with results of the experiment P(X=1)=0.168 (theory), P(X=1)=0.182 (experiment) \(\mu=6\), \(\overline{X}=6.05\)
do they agree or is the theory bad?
Note: most “theories” we look at are not big scientific theories but simple things like “Our new drug works better than the currently available one”. Well, if this a new drug for cancer maybe it is a pretty big theory after all!
Say a certain population is known to be described by a normal distribution with mean 100 and standard deviation 30. From textbooks we can find the following information:
We can get simulated data from a normal distribution with mean 100 and standard deviation 30 with
x <- rnorm(10000, 100, 30)
round(mean(x), 1)
## [1] 99.8
round(sd(x), 1)
## [1] 29.7
round(quantile(x, 0.25), 1)
## 25%
## 80
sum(x<80)/10000
## [1] 0.2501
sum(85<x &x<115)/10000
## [1] 0.3855