Estimation

Point Estimation

Here is the type of problem we often see in Statistics:

Example We want to know the percentage \(\mathbf{\pi}\) of students at the Colegio who like the food in the Cafeteria. \(\pi\) is the percentage for all students, that is for the whole population, so \(\pi\) is a parameter.

For each student there are two possibilities: he/she likes the food or he/she does not like the food. So if we randomly select a student he/she is a Bernoulli trial. If we randomly select n students we have a sample of n Bernoulli trials.

We know that for a Bernoulli trial with parameter \(\pi\) the population mean \(\mu = \pi\).

If we did a good job when selecting the students for the sample, then we would hope that the sample mean \(\overline{X}\) is close to the population mean \(\mu\), that is

\(\overline{X} \sim \mu = \pi\) (roughly)

and so we can estimate the parameter \(\pi\) with \(\overline{X}\)

Confused? For once I actually on purpose made something simple look complicated. All I just told you is this: if you ask 500 people whether they like Coca-Cola and 300 say yes, your best guess for the proportion of people who like Coca-Cola is 300/500! Here \(\overline{X}\) = 300/500.

The point is that this simple idea is useful in much more complicated cases as well.



Often in Statistics we use Greek letters for parameters and regular letters with a hat or a bar for estimators.



What we have done is the following:

  • we decided that the population can be described by a certain distribution, but without knowing the parameter

  • formulas let us compute the parameter (\(\mu = \pi\))

  • Statistics lets us find the corresponding statistics (\(\overline{X}\))

  • combine the two to estimate the parameter (\(\pi\sim \overline{X}\))

Each population parameter has a corresponding statistic and vice versa.

Sample mean - population mean
Sample percentage - population percentage
Sample median - population median
Sample standard deviation - population standard deviation
Sample 1st quartile - population 1st quartile
etc ….

Each of these sample numbers is called a point estimate.

Example: for data from a geometric distribution we have

\[ \begin{aligned} \mu &= \frac{1}{\pi} \\ \sigma &= \frac{\sqrt{\pi (1-\pi )}}{\pi} \end{aligned} \]

B <- 10000
p <- 0.5
x <- rgeom(B, p) + 1 
round(c(mean(x), 1/p, sd(x), sqrt(1-p)/p), 3)
## [1] 2.003 2.000 1.428 1.414
B <- 10000
p <- 0.1
x <- rgeom(B, p) +1 
round(c(mean(x), 1/p, sd(x), sqrt(1-p)/p), 3)
## [1]  9.853 10.000  9.515  9.487
B <- 10000
p <- 0.8
x <- rgeom(B, p) + 1 
round(c(mean(x), 1/p, sd(x), sqrt(1-p)/p), 3)
## [1] 1.250 1.250 0.559 0.559

So let’s say you know that the following is a sample from a geometric distribution:

x
##  [1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
## [24]  2  2  2  2  2  3  3  3  4  4  4  4  4  4  4  5  5  5  5  6  7  7  8
## [47]  9 10 14 14

and you want to estimate the rate p. We have

\[ \mu=1/p \rightarrow p=1/\mu \sim 1/ \bar{x} \] therefore a point estimate for p is

round(1/mean(x), 3)
## [1] 0.303

In principle we could also have used the formula for \(\sigma\), but then we would have to solve a quadratic equation…

Interval Estimation

In real life a point estimate is rarely enough, usually we also need some estimate of the error in our estimate.

Example Again we did a survey of the undergraduate students at the Colegio. For that we interviewed 150 randomly selected students, and found a (sample) mean GPA of 2.53. We really want to know the (population) mean GPA of all the undergraduates at the Colegio.

Now the “2.53” is the mean GPA specifically for the sample we collected, if we repeated the whole process and found a different sample, we would also get a different sample mean. Let’s say the (population) mean GPA for all the undergraduates is \(\mu_{GPA}\). It is is pretty clear that \(\mu_{GPA} \ne 2.53\), but hopefully \(\mu_{GPA}\) is close to 2.53. Only, how close?

One way to answer such questions is to find an interval estimate rather than a point estimate. Specifically we will consider a type of interval estimate called a confidence interval.

We will learn about confidence intervals using the mean \(\mu\) as an example. Here the formal definition is

A 100(1-\(\alpha\))% confidence interval for the population mean \(\mu\) is given by

\[ \overline{X} \pm t_{n-1, \alpha/2}\frac{s}{\sqrt{n}} \]

First notice that the interval is given in the form point estimate \(\pm\) error, which is quite often true in Statistics, although not always.

We already know all the ingredients of this formula with the exception of \(t_{n-1, \alpha/2}\). You can think of this as a mathematical adjustment factor. At any rate, R will do all of this for us:

Example

Say in our survey we found a sample mean GPA of 2.53 with a standard deviation of 0.65. Find a 90% confidence interval for the mean GPA.

This is done with

one.sample.t(2.53, shat=0.65, n=150, conf.level=90, ndigit=3)
## A 90% confidence interval for the population mean is (2.442, 2.618)

and so our 90% confidence interval is

\((2.442, 2.618)\)

Note the \(90\%\) is called the confidence level.



What does that mean: a 90% confidence interval for the mean is \((2.442, 2.618)\)? The interpretation is this:

suppose that over the next year statisticians (and other people using statistics) all over the world compute 100,000 90% confidence intervals, many for the mean, others maybe for medians or standard deviations or …, than about 90% or about 90,000 of those intervals will actually contain the parameter that is supposed to be estimated, the other 10,000 or so will not.

Let’s illustrate all this:

App:

run.app(confint)

this app generates 50 observations from a N(100, 25) and finds the 90% confidence interval. By clicking on the run button we get a few examples.

We can change the input parameters and see the effects.

Then we can switch to the many examples tab, where we get the interval for 100 examples. For each we see the interval and whether it is good or bad. Finally we see the percentage of good intervals, which should match (somewhat) the chosen confidence level.


But why would we be willing to accept a 10% chance of being “wrong”, that is of getting an interval that does not contain the true parameter? Well, we don’t have to, after all we chose to compute a 90% confidence interval. Instead we could have found a 99% confidence interval and only leave a 1% chance being “wrong”.

limits90 <- one.sample.t(2.53, shat=0.65, n=150, 
     conf.level=90, ndigit=3, return.result = TRUE)
limits99 <- one.sample.t(2.53, shat=0.65, n=150, 
     conf.level=99, ndigit=3, return.result = TRUE)
limits90
##   Low  High 
## 2.442 2.618
limits99
##   Low  High 
## 2.392 2.668

Notice the lengths of these intervals:

limits90[2]-limits90[1]
## [1] 0.176
limits99[2]-limits99[1]
## [1] 0.276

So the \(99\%\) interval is larger!

Here a five of them:

## Length of  68 % confidence interval:  0.106 
## Length of  90 % confidence interval:  0.176 
## Length of  95 % confidence interval:  0.210 
## Length of  99 % confidence interval:  0.276 
## Length of  99.5 % confidence interval:  0.302

So finding confidence intervals involves a trade-off: if we make the probability of being wrong smaller we (almost always) make the interval larger.

The only way to make an interval smaller without changing the confidence level is to get a larger data set!