Basic Concepts of Statistics

Statistics is of course a whole branch of Science all by itself. Here we will just introduce a few basic ideas.

Probability theory and Statistics have in common that both start with a probability model. Typically such models have parameters, for example the success probability p in a Bernoulli rv or the rate \(\lambda\) in an exponential distribution. In probability theory we then have problems such as: if p=0.2, what is the mean of the Bernoulli rv? In other words we assume we know the parameters and then ask questions about possible outcomes.

In Statistics it is exactly the other way around: we already have observed outcomes from a rv and we are asking what the parameters might be.

The Likelihood Function

Definition (3.7.1)

Let \(\pmb{X}=(X_1,..,X_n)\) be a random vector with joint density \(f(x_1,..,x_n\vert \pmb{\theta})\) where \(\pmb{\theta}\) is a vector of parameters. Then

\[L(\pmb{\theta} \vert x_1,..,x_n)=f(x_1,..,x_n\vert \pmb{\theta})\] is called the likelihood function.

If \(X_1,..,X_n\) are independent, then

\[L(\pmb{\theta}\vert \pmb{x})=\prod_{i=1}^n f(x_i\vert \pmb{\theta})\]

Example (3.7.2)

\(X_i\sim N(\mu, \sigma^2)\), then

\[ \begin{aligned} &L(\mu, \sigma^2\vert \pmb{x})=\prod_{i=1}^n f(x_i\vert \mu, \sigma^2) = \\ &\prod_{i=1}^n \frac1{\sqrt{2\pi \sigma^2}} \exp \left\{-\frac1{2\sigma^2}(x_i-\mu)^2 \right\} = \\ & (2\pi\sigma^2)^{-n/2} \exp \left\{-\frac1{2\sigma^2} \sum_{i=1}^n(x_i-\mu)^2 \right\} \end{aligned} \]

Point Estimation

Definition (3.7.3)

A function \(T(\pmb{x})\) of the data is called a statistic or estimator.

A statistic \(T(\pmb{x})\) is called an unbiased estimator of \(\pmb{\theta}\) if

\[E[T(\pmb{X})]=\pmb{\theta}\]

Example (3.7.4)

\(X_i\sim N(\mu, \sigma^2)\), then \(T(\pmb{x})=\bar{x}\) is an unbiased estimator of \(\mu\) because

\[E[T(\pmb{X})] = E[\bar{X}]=E[\frac1n \sum_{i=1}^n X_i] = \frac1n \sum_{i=1}^n E[X_i]=\mu\]

Maximum Likelihood

Definition (3.7.5)

Let \(\pmb{\hat{\theta}}=argmax \left\{L(\pmb{\theta}\vert \pmb{x}); \pmb{\theta} \right\}\), then \(\pmb{\hat{\theta}}\) is called the *maximim likelihood estimator of \(\pmb{\theta}\).

Example (3.7.6)

\[ \begin{aligned} &L(\mu, \sigma^2\vert \pmb{x}) = (2\pi\sigma^2)^{-n/2} \exp \left\{-\frac1{2\sigma^2} \sum_{i=1}^n(x_i-\mu)^2 \right\}\\ &\log L(\mu, \sigma^2\vert \pmb{x}) = -\frac{n}2\log (2\pi\sigma^2) -\frac1{2\sigma^2} \sum_{i=1}^n(x_i-\mu)^2\\ &\frac{\log L(\mu, \sigma^2\vert \pmb{x})}{d \mu} = \frac1{\sigma^2} \sum_{i=1}^n(x_i-\mu) = 0\\ &\hat{\mu}=\bar{x} \end{aligned} \]

Confidence Interval

Definition (3.7.7)

A (random) interval of the form \((L(\pmb{X}),U(\pmb{X}))\) is called a \((1-\alpha)100\%\) confidence interval for \(\theta\) if

\[P(L(\pmb{X})<\theta<U(\pmb{X}))\ge 1-\alpha\]

Example (3.7.8)

\(X_i\sim N(\mu, \sigma^2)\), \(\sigma^2\) known, then

\[ \begin{aligned} &L(\pmb{x}) = \bar{x}-z_{\alpha/2}\sigma/\sqrt{n}\\ &U(\pmb{x}) = \bar{x}+z_{\alpha/2}\sigma/\sqrt{n}\\ \end{aligned} \] is a \((1-\alpha)100\%\) confidence interval for \(\mu\). Here \(z_\alpha\) is is upper \(\alpha\) percentile of a standard normal distribution.

\[ \begin{aligned} &P(L(\pmb{X})<\theta<U(\pmb{X})) = \\ &P(\bar{X}-z_{\alpha/2}\sigma/\sqrt{n}<\mu<\bar{X}+z_{\alpha/2}\sigma/\sqrt{n}) = \\ &P(\mu-z_{\alpha/2}\sigma/\sqrt{n}<\bar{X}<\mu+z_{\alpha/2}\sigma/\sqrt{n}) = \\ &P(-z_{\alpha/2}<\sqrt{n}\frac{\bar{X}-\mu}{\sigma}<z_{\alpha/2}) =\\ &2\Phi(z_{\alpha/2})-1 = 2(1-\alpha/2)-1=1-\alpha \end{aligned} \]

where \(\Phi\) is the cdf of a standard normal distribution.

Hypothesis Test

Definition (3.7.9)

In testing \(H_0:\theta=\theta_0\)

  • the error of the first kind is to reject the null hypothesis although it is true. The probability of the error of the first kind is denoted by \(\alpha\).

  • the error of the second kind is to fail to reject the null hypothesis although it is false. The probability of the error of the second kind is denoted by \(\beta\). \(1-\beta\) is called the power of the test.

  • The p-value of a test is the probability to observe a value of the test statistic as unlike as that just observed or even more so, given that the null hypothesis is true.

Example (3.7.10)

\(X_i\sim N(\mu, \sigma^2)\), \(\sigma^2\) known, then a test for \(H_0:\mu=\mu_0\) is to reject the null hypothesis if \(Z=\vert\sqrt{n}\frac{\bar{x}-\mu_0}{\sigma}\vert\) is greater than \(z_{\alpha/2}\).

So

\[ \begin{aligned} &P(\text{reject } H_0\vert H_0 \text{ true}) = \\ &P(Z>z_{\alpha/2}\vert \mu=\mu_0) = \\ &1-P(\vert\sqrt{n}\frac{\bar{x}-\mu_0}{\sigma}\vert<z_{\alpha/2}\vert \mu=\mu_0) = \\ &1-P(-z_{\alpha/2}<\sqrt{n}\frac{\bar{x}-\mu_0}{\sigma}<z_{\alpha/2}\vert \mu=\mu_0) = \\ &1-(1-\alpha)=\alpha \end{aligned} \] and

\[ \begin{aligned} &\beta=P(\text{fail to reject } H_0\vert H_0 \text{ false}) = \\ &P(Z<z_{\alpha/2}\vert \mu=\mu_1) = \\ &P(\vert\sqrt{n}\frac{\bar{x}-\mu_0}{\sigma}\vert<z_{\alpha/2}\vert \mu=\mu_1) = \\ &P(-z_{\alpha/2}<\sqrt{n}\frac{\bar{x}-\mu_0}{\sigma}<z_{\alpha/2}\vert \mu=\mu_1) = \\ &P(-z_{\alpha/2}<\sqrt{n}\frac{\bar{x}-\mu_1+\mu_1-\mu_0}{\sigma}<z_{\alpha/2}\vert \mu=\mu_1) = \\ &P(-z_{\alpha/2}<\sqrt{n}\frac{\bar{x}-\mu_1}{\sigma}+\sqrt{n}\frac{\mu_1-\mu_0}{\sigma}<z_{\alpha/2}\vert \mu=\mu_1) = \\ &P(-z_{\alpha/2}-\sqrt{n}\frac{\mu_1-\mu_0}{\sigma}<\sqrt{n}\frac{\bar{x}-\mu_1}{\sigma}<z_{\alpha/2}+\sqrt{n}\frac{\mu_1-\mu_0}{\sigma}\vert \mu=\mu_1) = \\ &\Phi(z_{\alpha/2}+\sqrt{n}\frac{\mu_1-\mu_0}{\sigma})-\Phi(-z_{\alpha/2}+\sqrt{n}\frac{\mu_1-\mu_0}{\sigma}) \end{aligned} \]

and so we see that \(\beta\) (and therefore the power of a test) is a function of the true value of the parameter. It is often shown in the form of a power graph:

Example (3.7.11)

Say

powr=function(mu1, n=10, mu0=0, sigma=1, alpha=0.05) {
   za=qnorm(1-alpha/2)
   beta=pnorm(za-sqrt(n)*(mu1-mu0)/sigma)-pnorm(-za-sqrt(n)*(mu1-mu0)/sigma)
   100*(1-beta)
}
curve(powr, -2, 2, lwd=2,col="blue")

Sample Size and Effect Size

An important question during the planning stage of an experiment is how many observations need to be collected. First of all this depends on what analyses is going to be done.

  • Confidence Interval

If eventually a confidence interval is to be found, one needs to decide what length of the interval is acceptable. More specifically one chooses the error, or 1/2 the length. This choice depends on the background of the experiment, and what size interval is small enough to yield an interesting result.

Example (3.7.12)

\(X_i\sim N(\mu, \sigma^2)\), \(\sigma^2\) known, then we have the interval

\[\left(\bar{x}-z_{\alpha/2}\sigma/\sqrt{n},\bar{x}+z_{\alpha/2}\sigma/\sqrt{n}\right)\]

so \(E=z_{\alpha/2}\sigma/\sqrt{n}\) and so \(n=\left(\frac{z_{\alpha/2}\sigma}{E}\right)^2\)

Say we want E=0.1 and we know \(\sigma=1\), then

round((qnorm(0.975)*1/0.1)^2)
## [1] 384
  • Hypothesis Test

If we plan on carrying out a hypothesis we need to decide what power the test is supposed to have, one typical choice is at least 80%. From the calculation above it is clear that we also need \(\mu_1\). \(|\mu_0-\mu_1|\) is often called the effect size, the smallest deviation from the null hypothesis of practical interest.

Example (3.7.13)

\(X_i\sim N(\mu, \sigma^2)\), \(\sigma^2\) known, and say we decide to use \(\mu_1=0.1\). We have the equation

\[\beta = \Phi(z_{\alpha/2}+\sqrt{n}\frac{\mu_1-\mu_0}{\sigma})-\Phi(-z_{\alpha/2}+\sqrt{n}\frac{\mu_1-\mu_0}{\sigma})\]

which we want to solve for n.ย This has to be done numerically:

n=1
repeat {
   n=n+1
   if(powr(mu1=0.1, n=n)>80) break
}
n
## [1] 785

Bayesian Analysis

An entirely different approach to statistics is the Bayesian. Here one treats parameters as random variables with a distribution called a prior, and then uses probability theory to combine the likelihood and the prior intothe posterior distribution.

Example (3.7.14)

\(X_i\sim N(\mu, \sigma^2)\), \(\sigma^2\) known, and we use \(\pi(\mu)=1\) as the prior. Note that this is not a proper density because \(\int_{-\infty}^\infty 1 dx=\infty\). This is ok as long as the posterior is a proper distribution.

First note that

\[ \begin{aligned} &\sum_{i=1}^n(x_i-\mu)^2 = \\ &\sum_{i=1}^n(x_i-\bar{x}+\bar{x}-\mu)^2 = \\ &\sum_{i=1}^n(x_i-\bar{x})^2+2(x_i-\bar{x})(\bar{x}-\mu)+ (\bar{x}-\mu)^2 = \\ &\sum_{i=1}^n (x_i-\bar{x})^2+2(\bar{x}-\mu)\sum_{i=1}^n(x_i-\bar{x})(\bar{x}-\mu)+ n(\bar{x}-\mu)^2\\ &\sum_{i=1}^n(x_i-\bar{x})^2+2(x_i-\bar{x})(\bar{x}-\mu)+ (\bar{x}-\mu)^2 = \\ &\sum_{i=1}^n (x_i-\bar{x})^2+ n(\bar{x}-\mu)^2 \end{aligned} \] and so

\[ \begin{aligned} &f(\pmb{x},\mu)=(2\pi\sigma^2)^{-n/2} \exp \left\{-\frac1{2\sigma^2} \sum_{i=1}^n(x_i-\mu)^2 \right\}\pi(\mu) = \\ &(2\pi\sigma^2)^{-n/2} \exp \left\{-\frac1{2\sigma^2}\left[ \sum_{i=1}^n (x_i-\bar{x})^2+ n(\bar{x}-\mu)^2 \right]\right\} =\\ &(2\pi\sigma^2)^{-n/2} \exp \left\{-\frac1{2\sigma^2}\left[ \sum_{i=1}^n (x_i-\bar{x})^2 \right]\right\}\exp \left\{-\frac1{2\sigma^2}\left[ n(\bar{x}-\mu)^2 \right]\right\} \\ &m(\pmb{x}) =\int_{-\infty}^\infty f(\pmb{x},\mu) d\mu=\\ &\int_{-\infty}^\infty (2\pi\sigma^2)^{-n/2} \exp \left\{-\frac1{2\sigma^2}\left[ \sum_{i=1}^n (x_i-\bar{x})^2 \right]\right\}\exp \left\{-\frac1{2\sigma^2}\left[ n(\bar{x}-\mu)^2 \right]\right\} d\mu=\\ & \exp \left\{-\frac1{2\sigma^2}\left[ \sum_{i=1}^n (x_i-\bar{x})^2 \right]\right\}n^{n/2}(2\pi(\sigma^2/n)^{-(n-1)/2}\\ &\int_{-\infty}^\infty (2\pi(\sigma^2/n)^{-1/2} \exp \left\{-\frac{1}{2(\sigma^2/n)}(\mu-\bar{x})^2 \right\} d\mu=\\ & \exp \left\{-\frac1{2\sigma^2}\left[ \sum_{i=1}^n (x_i-\bar{x})^2 \right]\right\}n^{n/2}(2\pi(\sigma^2/n)^{-(n-1)/2} \end{aligned} \] because the integral is over a \(N(\bar{x},\sigma^2/n)\) density and therefore is 1.

Finally

\[ \begin{aligned} &f(\mu\vert \pmb{x}) = \frac{f(\pmb{x}, \mu)}{m(\pmb{x})}=\\ &\frac{\exp \left\{-\frac1{2\sigma^2}\left[ \sum_{i=1}^n (x_i-\bar{x})^2 \right]\right\}n^{n/2} (2\pi(\sigma^2/n)^{-n/2} \exp \left\{-\frac{1}{2(\sigma^2/n)}(\mu-\bar{x})^2 \right\}}{\exp \left\{-\frac1{2\sigma^2}\left[ \sum_{i=1}^n (x_i-\bar{x})^2 \right]\right\}n^{n/2}(2\pi(\sigma^2/n)^{-(n-1)/2}} = \\ &\frac1{\sqrt{2\pi(\sigma^2/n)}} \exp \left\{-\frac{1}{2(\sigma^2/n)}(\mu-\bar{x})^2 \right\} \end{aligned} \]

and so we find

\[\mu|\pmb{X=x}\sim N(\mu, \sigma^2/n)\]

Say we want to find a \((1-\alpha)100\%\) credible interval for \(\mu\), that is numbers L and U such that

\[P(L<\mu<U|\pmb{X=x})=1-\alpha\]

Note that here L and U are numbers, not random variables as in the frequentist calculation above. Also note that we have one equation in two unknowns, so L and U are not uniquely defined. We can get a solution by imposing an additional condition. For example, we can find

\[P(\mu<L|\pmb{X=x})=P(\mu>U|\pmb{X=x})=\alpha/2\]

so

\[ \begin{aligned} &\alpha/2 = P(\mu<L|\pmb{X=x}) = \Phi(L; \mu,\sigma^2/n) \\ &L =\Phi^{-1}(\alpha/2; \mu,\sigma^2/n) \\ \end{aligned} \]

where \(\Phi(x;\mu,\sigma^2)\) is the cdf of a \(N(\mu,\sigma^2)\) and \(\Phi^{-1}\) is its inverse.

Notice that in this case the frequentist confidence interval and the Bayesian credible interval are the same numerically, but their interpretation is very different.