Let’s have a look at the waiting times:
bw <- diff(range(faithful$Waiting.Time))/50
ggplot(faithful, aes(Waiting.Time)) +
geom_histogram(color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Counts")
clearly the distribution of this data is non standard. We previously fit a mixture of normal densities to the data. In this section we will consider several non parametric methods. The basic problem is to estimate the density \(f(x)\) by \(\hat{f}(x)\) for each x.
The most commonly used method is called kernel density estimation. The idea is to average over the data points, giving greater weight to points near x. In general we have
\[ \hat{f}(x)=\frac1n \sum K(\frac{x-X_i}{h}) \]
here K is kernel function and h is a tuning parameter.
This is implemented in
fit <- density(faithful$Waiting.Time)
ggplot(faithful, aes(Waiting.Time)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
geom_line(data=data.frame(x=fit$x, y=fit$y),
aes(x, y), color="blue")
this uses a normal density as the kernel function. It has several formulas implemented for the choice of h. The default uses
bw.nrd0(faithful$Waiting.Time)
## [1] 3.987559
but it is often recommended to change that to
bw.SJ(faithful$Waiting.Time)
## [1] 2.504371
a method by Sheather and Jones
here are several:
fit1 <- density(faithful$Waiting.Time, bw=1.5)
fit2 <- density(faithful$Waiting.Time, bw=10)
ggplot(faithful, aes(Waiting.Time)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
geom_line(data=data.frame(x=fit$x, y=fit$y),
aes(x, y), color="blue") +
geom_line(data=data.frame(x=fit1$x, y=fit1$y),
aes(x, y), color="red") +
geom_line(data=data.frame(x=fit2$x, y=fit2$y),
aes(x, y), color="green")
and it is clear that for small values of h we get a very ragged (under-smoothed) estimate whereas for an h to large it is the other way around.
There is a long list of density estimation methods and associated routines. For a review see Density estimation in R. Two libraries worth knowing about are
this stands for averaged and shifted histograms, an idea due to Scott (1992).
library(ash)
fit <- ash1(bin1(faithful$Waiting.Time))
## [1] "ash estimate nonzero outside interval ab"
ggplot(faithful, aes(Waiting.Time)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
geom_line(data=data.frame(x=fit$x, y=fit$y),
aes(x, y), color="blue")
a method due to Wand & Jones (1995). It’s main advantage is that it also works for higher dimensional data.
library(KernSmooth)
data(geyser, package="MASS")
x <- cbind(geyser$duration, geyser$waiting)
est <- bkde2D(x, bandwidth=c(0.7, 7))
contour(est$x1, est$x2, est$fhat)
persp(est$fhat)
Most methods encounter problems when the true density does not go to 0 at the edges:
df <- data.frame(x=rexp(1000, 1))
fit <- density(df$x)
bw <- diff(range(df$x))/50
df1 <- data.frame(x=seq(0, 6, length=100),
y=exp(-seq(0, 6, length=100)))
ggplot(df, aes(x)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
geom_line(data=data.frame(x=fit$x, y=fit$y),
aes(x, y), color="blue") +
geom_line(data=df1,
aes(x, y), color="black")
the reason is clear: The formula expects data on both sides of x, but in this case at x=0 there is data only on one side.
A simple solution is to “mirror” the data:
fit <- density(c(df$x, -df$x))
fit$y <- 2*fit$y[fit$x>0]
fit$x <- fit$x[fit$x>0]
bw <- diff(range(df$x))/50
ggplot(df, aes(x)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
geom_line(data=data.frame(x=fit$x, y=fit$y),
aes(x, y), color="blue") +
geom_line(data=df1,
aes(x, y), color="black")
notice the factor of 2, which is needed to scale the density to integrate to 1. In general this rescaling can be difficult.
This would work perfectly fine for a density that has slope 0 at the boundary points, for example a uniform. It still is wrong in our case, because our density has slope -1.
Other more complicated solutions are known but none of them is very satisfactory.
There is a version of the Bootstrap that can be helpful at times. The idea is to sample from a non-parametric density estimate instead of the empirical distribution function. For this however we would need to discuss how to simulate from a general distribution function. Come to ESMA 5015 Simulation next time!
A well known data set in statistics has the thicknesses (espesor) in millimeters of 485 Mexican stamps (sello) printed in 1872-1874, from the 1872 Hidalgo issue.
It is thought that the stamps from this issue are a “mixture” of different types of paper, of different thicknesses. Can we determine from the data how many different types of paper were used?
kable.nice(matrix(stamps[1:50], nrow=10))
0.060 | 0.069 | 0.07 | 0.070 | 0.071 |
0.064 | 0.069 | 0.07 | 0.070 | 0.071 |
0.064 | 0.069 | 0.07 | 0.070 | 0.071 |
0.065 | 0.070 | 0.07 | 0.070 | 0.071 |
0.066 | 0.070 | 0.07 | 0.070 | 0.071 |
0.068 | 0.070 | 0.07 | 0.070 | 0.071 |
0.069 | 0.070 | 0.07 | 0.070 | 0.071 |
0.069 | 0.070 | 0.07 | 0.070 | 0.071 |
0.069 | 0.070 | 0.07 | 0.070 | 0.071 |
0.069 | 0.070 | 0.07 | 0.071 | 0.071 |
Let’s start with
bw <- diff(range(stamps))/50
df <- data.frame(Thickness=stamps)
ggplot(df, aes(Thickness)) +
geom_histogram(color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Counts")
which seems to have at least two modes. This judgement however is tricky because it depends on the number of bins we use.
An alternative is to use a frequency polygon
ggplot(df, aes(Thickness)) +
geom_freqpoly()
which seems to suggest a much larger number of modes.
Let’s instead draw the graph using a nonparametric density estimate:
ggplot(df, aes(Thickness)) +
stat_density(geom="line")
here it seems again like there are two modes, but this depends largely on the chosen bandwidth:
pushViewport(viewport(layout = grid.layout(1, 2)))
print(ggplot(df, aes(Thickness)) +
stat_density(geom="line", bw=0.01) ,
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(ggplot(df, aes(Thickness)) +
stat_density(geom="line", bw=0.001) +ylab("") ,
vp=viewport(layout.pos.row=1, layout.pos.col=2))
stat_density implements a kernel density estimator as discussed above, with choices of different kernels and bandwidth selectors. In what follows we will need to explicitly calculate these estimates and use the density routine.
From the above it is clear that the number of modes depends on the choice of h. It is possible to show that the number of modes is a non-increasing function of h. At the extremes we would have a simple normal distribution with one mode (h large and on the other a sharply peaked mode at each observation (h tiny)
Let’s say we want to test \[ H_0: \text{number of modes}=1 \text{ vs. } H_a: \text{number of modes}>1 \] Because the number of modes is a non-increasing function of h there exists an \(h_1\) such that the density estimator has one mode for \(h<h_1\) and two or more modes for \(h>h_1\). Playing around with
fhat <- function(h, t, dta=stamps) {
tmp <- density(dta, bw=h)
df <- data.frame(x=tmp$x, y=tmp$y)
if(missing(t)) return(df)
out <- approx(df, xout=t)$y
out[!is.na(out)]
}
draw.fhat <- function(h)
ggplot(fhat(h), aes(x, y)) + geom_line()
pushViewport(viewport(layout = grid.layout(2, 2)))
print(draw.fhat(0.01) ,
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(draw.fhat(0.005) ,
vp=viewport(layout.pos.row=1, layout.pos.col=2))
print(draw.fhat(0.0075) ,
vp=viewport(layout.pos.row=2, layout.pos.col=1))
print(draw.fhat(0.0068) ,
vp=viewport(layout.pos.row=2, layout.pos.col=2))
we find \(h_1 \sim 0.0068\).
Is there a way to calculate the number of modes for a given h? here is one:
Let’s write a simple routine that automates the process. It uses a bisection algorithm.
x.points <- seq(min(stamps), max(stamps), length = 250)
calc.num.mode = function(y) {
m <- length(y) - 1
z <- diff(y)
sum(ifelse(z[-m] >= 0 & z[-1] < 0, 1, 0))
}
find.h <- function(num.modes, h=0.007, Show=FALSE) {
repeat {
h <- h-0.001
if(Show)
cat("h =", h, " modes=",
calc.num.mode(fhat(h, x.points)), "\n")
if(calc.num.mode(fhat(h, x.points)) >= num.modes) break
}
low <- h
high <- h + 0.001
repeat {
h <- (low+high)/2
if(Show)
cat("h =", h, " modes=",
calc.num.mode(fhat(h, x.points)), "\n")
if(calc.num.mode(fhat(h, x.points)) < num.modes)
high <- h
else
low <- h
if(high-low<10^-7)
break
}
h
}
h1 <- find.h(1, Show = TRUE)
## h = 0.006 modes= 2
## h = 0.0065 modes= 2
## h = 0.00675 modes= 1
## h = 0.006875 modes= 1
## h = 0.0069375 modes= 1
## h = 0.00696875 modes= 1
## h = 0.006984375 modes= 1
## h = 0.006992188 modes= 1
## h = 0.006996094 modes= 1
## h = 0.006998047 modes= 1
## h = 0.006999023 modes= 1
## h = 0.006999512 modes= 1
## h = 0.006999756 modes= 1
## h = 0.006999878 modes= 1
## h = 0.006999939 modes= 1
h5 <- find.h(5)
pushViewport(viewport(layout = grid.layout(1, 2)))
print(draw.fhat(h1) ,
vp=viewport(layout.pos.row=1, layout.pos.col=1))
print(draw.fhat(h5) ,
vp=viewport(layout.pos.row=1, layout.pos.col=2))
So, how we can test
\[ H_0: \text{number of modes}=1 \text{ vs. } H_a: \text{number of modes}>1 \] Here it is:
the idea is this; if there is indeed just one mode, then in the bootstrap samples \(h_1^{*}\) should be around \(h_1\) and so this proportions shouldn’t be to small.
Notice we don’t actually need \(h_1^{*}\), we just need to check if \(h_1^{*}>h_1\), which is the case if \(\hat{f}(x^{*}; h_1^{*})\) has at least two modes.
Note that we are not drawing bootstrap samples from “stamps” but from a density estimate, \(\hat{f}\). So this is an example of the smooth bootstrap mentioned above.
How do we draw from fhat? It can be shown that if \(y_1^{*},.., y_n^{*}\) is a bootstrap sample from the data, then a smooth bootstrap sample is given by
\[ x_i^{*}=\bar{y^{*}} +(1+h_1^{*}/s^2)^{-1/2}(y_i^{*}-\bar{y^{*}}+h_1^{*}\epsilon_i) \] where \(\epsilon_i \sim N(0,1)\)
test.modes <- function(k) {
h <- find.h(k+1)
q <- 1/sqrt((1 + h^2/var(stamps)))
B <- 1000
hstar <- rep(0, B)
for (i in 1:B) {
ystar <- sample(stamps, size = 485, replace = TRUE)
xstar <- mean(ystar) + q*(ystar-mean(ystar) +
h*rnorm(485))
y <- fhat(h, x.points, dta=xstar)
if (calc.num.mode(y) > k)
hstar[i] <- 1
}
length(hstar[hstar > h])/B
}
test.modes(1)
## [1] 0
and so we find strong evidence against the null, there are more than one modes.
The same method works for testing
\[ H_0: \text{number of modes}=k \text{ vs. } H_a: \text{number of modes}>k \]
and we find
for(k in 2:9)
cat("k =", k, ", p =", test.modes(k),"\n")
## k = 2 , p = 0.31
## k = 3 , p = 0.037
## k = 4 , p = 0.007
## k = 5 , p = 0
## k = 6 , p = 0
## k = 7 , p = 0.344
## k = 8 , p = 0.783
## k = 9 , p = 0.549
So there are certainly more than one mode, with a chance for as many 7.