Chapter 5: Probability

Why Probability?

Case Study: Is this Coin Fair?

Let’s say you have a specific coin and you want to see whether it is actually a fair coin, or whether it comes up heads more than tails. So you sit down and flip the coin 1000 times and get \(N\) heads.

What do you think \(N\) should be for this to NOT be a fair coin? Because (for some reason) we only care about the case were we get more heads than tails obviously we need \(N>500\) before we would conclude that this is an unfair coin. But also quite clearly (say) \(N=510\) would not be convincing enough to say this is an unfair coin.

But why not?

Simply because a fair coin can easily give \(510\) heads in \(1000\) flips, or the probability of \(510\) heads in \(1000\) flips of a fair* coin is not so small.

Actually it is very small (\(0.0207\)) but there are many possible outcomes (478, 501, 512, 523 and so on), and so the probability of any one of them has to be small. For example 500 heads in 1000 flips has a probability of 0.0252, even though it is the most likely outcome. And relative to 0.0252 0.0207 is large.

Because of this what we will calculate is the probability of N or more heads in 1000 flips of a fair coin. Now we find

N or more	Probability
500	0.513
505	0.388
510	0.274
515	0.180
520	0.109
525	0.061
530	0.031
535	0.015
540	0.006
545	0.002
550	0.001

and so somewhere around N=530 these probabilities get very small, it is then more reasonable to think that the coin is actually not fair.

The topics we have discussed so far in this class (making graphs, computing things like the mean or the median) go under the heading of descriptive statistics. What we want to do now is make a guess what the true “state of nature” is based on the available information, namely decide whether this is a fair coin or not. This type of problem goes under the heading of inferential statistics. As we just saw, this generally means calculating some probabilities.

By the way eventually we will call the probabilities in the table p-values.

Statistically Significant

In Statistics you often hear statements like:

“The new drug is statistically significantly better than the old one”

What does this mean?

The statement “The new drug is statistically significantly better than the old one” means that an experiment was carried out, and the new drug performed better than the old one. More than that, if the same experiment were repeated again (with different subjects) we expect the new drug to perform better again with a high probability (90%? 99%?)

Probability Distributions

A probability distribution is a model for a real live experiment

Example We roll a fair die:

Roll	Probability
1	1/6
2	1/6
3	1/6
4	1/6
5	1/6
6	1/6

Example we flip a fair coin 5 times. Consider the number of heads.

Number.Heads	Probability
0	0.031
1	0.156
2	0.312
3	0.312
4	0.156
5	0.031

Example We roll a fair die until the first six appears. How many rolls are needed?

This one is a bit more complicated because in theory it might take a very long time until we get a six (though in practice it won’t). Let’s do some calculations:

Prob(six on first roll) = 1/6

Prob(first six on second roll) = Prob(no six on first roll, a six on second roll) = \(5/6\times 1/6\)

Prob(first six on third roll) = Prob(no six on first two rolls, a six on third roll) = \(5/6 \times 5/6 \times 1/6\)

It’s easy to guess the general case:

Prob(first six on k^th roll) = \((5/6)^{k-1} 1/6\), k=1,2,..

So for example

Prob(first six on the 5^th roll) = P(X=5) = \((5/6)^{5-1} 1/6\) = 0.0804

App:

To illustrate this experiment run the app

run.app(geometric)

Probability distributions describe populations. The distribution in the first example tells us everything we might want to know about the population of all possible outcomes of the experiment “roll a fair die”. There are formulas for all sorts of things. For example, say we roll the die a million times and keep track of the rolls. Then we find the mean. What would it be? Theory tells us it is 3.5. How about the standard deviation? it would be 1.7. How do we know this? Because we can do some math and calculated them from the distribution.

There is a fairly small list of “basic” distributions that cover a wide variety of everyday experiments and random phenomena. Here is one:

Case Study: Bernoulli trials

The most basic experiment possible is one that has only two possible outcomes:

flip a coin - heads or tails
roll a die - get a six or don’t
take a class - pass or fail
person smokes - yes or no
person has open heart surgery - person survives or dies

any such experiment is called a Bernoulli trial, named after the Swiss mathematician Jakob Bernoulli (1655-1705)

Note often one of the two outcomes is called a “success” and the other one a “failure”. But for the calculations it makes no difference what is what. This sometimes leads to a bit of nastiness:

person has open heart surgery - person survives (=failure) or dies (=success)

Usually we “code” one outcome as \(x=0\) (=failure) and the other as \(x=1\) (=success) . Then the distribution is given by

x	Prob(x)
0	1-p
1	p

Note that this does not describe the experiment completely but only up to the p. This lets us “tailor” the distribution to specific experiments:

Example: Flip a fair coin, Heads = success = 1: p = 0.5

Example: Roll a fair die, “get a six” = success = 1: p=1/6

Example: Choose employee from WRInc, employee is female = success = 1, p=9510/23791=0.3997

Example choose people from some population, person has a genetic condition = success = 1: p=0.015

The number p then is a number that belongs to the population described by the distribution: it is a parameter!

Once we have a distribution we can find formulas for the population mean and standard deviation. For the Bernoulli distribution we have

Population Mean: \(\mu =p\)
Population Standard Deviation: \(\sigma = \sqrt{p(1-p)}\)

Notice how both the mean and the standard deviation depend on p alone. This will have some consequences later.

Continuous Distributions

Case Study: Weight of Subjects

Consider a weight loss study. At the beginning and at the end all subjects are weighed, and the difference in weight is found. What can we say about the population of weight differences?

One big difference to the examples above is that now we will get a great many different outcomes. If fact if we had a very precise scale, all the numbers might be different. In such a case a table like those above no longer works (it would be HUGE). Instead we typically look at graphs such as this one:

the idea behind such a curve is simple: the probability for a an outcome to be in a certain range is the area under the curve!

So if this curve describes the differences in weights (before-after), the people who lost weight are left of 0, so the probability that a randomly selected person on this program looses weight would be this area:

How can such an area be found? To do this in general we need calculus and more, but for some standard cases R can do it for us. In this case it would be 0.812 so a little over 80 percent of the participants can expect to loose some weight.