Note I will assume here that you are familiar with the basic ideas of Statistics.
Most of the basic methods from Statistics apply to stochastic processes as well.
The data set ex1 has 1000 observations from some discrete-time discrete-state space Markov chain:
ex1[1:20]
## [1] 1 4 1 4 1 4 5 5 3 4 5 3 4 5 5 5 3 2 3 2
ex1[981:1000]
## [1] 2 3 2 4 5 3 4 5 5 2 4 4 5 2 4 1 5 5 2 2
We want to estimate the transition matrix. Let’s use maximum likelihood. Let \(p_{ij}=P(X_2=j|X_1=i)\) be the transition probabilitites, and let \(n_{ij}\) be the number of times the chain went from i to j, then the likelihood function is given by
\[L(p_{11},..,p_{55})=\prod_{1\le i,j\le 5}p_{ij}^{n_{ij}}\]
and we need to maximize this subject to the conditions \(\sum_{j=1}^5 p_{ij}=1\), i=1,..,5.
As always we use the log-likelihood, and then using Lagrange multipliers we have
\[\sum_{1\le i,j\le 5} n_{ij} \log p_{ij} +\sum_{i=1}^5 \lambda_i\left(\sum_{j=1}^5 p_{ij}-1\right)\].
Now
\[ \begin{aligned} &\frac{d}{dp_{ij}} = \frac{n_{ij}}{p_{ij}}-\sum_{i=1}^5 \lambda_i=0\\ &p_{ij} = n_{ij}/\sum_{i=1}^5 \lambda_i\\ &1 = (\sum_{j=1}^5 n_{ij})/\sum_{i=1}^5 \lambda_i \\ &\sum_{i=1}^5 \lambda_i = n_{i.} \\ &\hat{p}_{ij} = n_{ij}/n_{i.} \end{aligned} \]
n <- matrix(0, 5, 5)
for(i in 2:1000) {
n[ex1[i-1], ex1[i]] <- n[ex1[i-1], ex1[i]]+1
}
n. <- apply(n, 1, sum)
phat <- round(n/n., 2)
dimnames(phat) <- list(1:5, 1:5)
kable.nice(phat)
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
1 | 0.11 | 0.34 | 0.00 | 0.36 | 0.19 |
2 | 0.00 | 0.36 | 0.19 | 0.45 | 0.00 |
3 | 0.00 | 0.36 | 0.17 | 0.46 | 0.00 |
4 | 0.33 | 0.00 | 0.09 | 0.11 | 0.46 |
5 | 0.10 | 0.28 | 0.24 | 0.00 | 0.37 |
Above we started with the assumption that this sequence is from a Markov chain. Can we test this? To do so we have to carry out the following hypothesis test. Let
\[p_{(i,j)k}=P(X_3=k|X_1=i,X_2=j)\]
then the Markov property (plus stationarity) imply
\[p_{(i,j)k}=p_{jk}\]
Applying the likelihood ratio test and the usual Taylor approximation to the logarithm yields the chi-square test statistic
\[X^2=\sum_{i,j,k}\left(n_{ijk}-e_{ijk}\right)^2/e_{ijk}\]
where \(n_{ijk}\) is the number of transitions from i to j to k and \(e_{ijk}=n_{ij.}n_{jk.}/n_{j.}\). Under the null hypothesis of a Markov chain we X2 will have a chi-square distribution with c3 degrees of freedom.
The test is implemented in the markovchain library:
library(markovchain)
verifyMarkovProperty(ex1)
## Testing markovianity property on given data sequence
## Chi - square statistic is: 37.57234
## Degrees of freedom are: 125
## And corresponding p-value is: 1
Could this sequence actually come from independent observations? We can of course do the basic test for independence, but notice that we never saw a transition from 3 to 1, which is impossible under independence!
A store wants to find out about the times when customers entered the store. They open at 8am and close at 6pm. For 20 working days they record the times and find
## [1] 483.1900 485.7712 504.1513 511.2431 533.9519 556.3276 564.5901
## [8] 600.1509 601.2124 613.2977 615.0961 656.5382 672.0500 678.0181
## [15] 691.2101 738.8729 742.7525 751.9882 767.3979 782.9098 816.6775
## [22] 832.7054 840.4809 861.4346 869.5839 894.2626 925.1999 997.7035
## [29] 998.6705 999.6550 1011.9642 1028.8511 1049.1714
Day[[1]]
[1] "8-03-11" "8-05-46" "8-24-09" "8-31-14" "8-53-57" "9-16-20"
[7] "9-24-35" "10-00-09" "10-01-13" "10-13-18" "10-15-06" "10-56-32"
[13] "11-12-03" "11-18-01" "11-31-13" "12-18-52" "12-22-45" "12-31-59"
[19] "12-47-24" "13-02-55" "13-36-41" "13-52-43" "14-00-29" "14-21-26"
[25] "14-29-35" "14-54-16" "15-25-12" "16-37-42" "16-38-40" "16-39-39"
[31] "16-51-58" "17-08-51" "17-29-10"
so on day 1 the first customer came in at 8h03m11s and the last one at 17h29m10s.
Do the arrivals follow a Poisson distribution?
We will assume that from the details of this store we know that the arrivals of customers have independent and stationary increments, so what we need to do is test whether the interarrival times come from an exponential distribution. For this we can use the Lilliefors test. It is implemented in the package KScorrect.
First, though we need the interarrival times:
inter.arrival.times <- NULL
for(i in 1:20) {
tmp <- strsplit(Day[[i]], "-")
hours <- as.numeric(unlist(tmp)[ c(TRUE, FALSE, FALSE) ])
minutes <- as.numeric(unlist(tmp)[ c(FALSE, TRUE, FALSE) ])
seconds <- as.numeric(unlist(tmp)[ c(FALSE, FALSE, TRUE) ])
tmp <- c(8*60, hours*60+minutes+seconds/100)
inter.arrival.times <- c(inter.arrival.times,
diff(tmp))
}
Let’s look at a graph:
df <- data.frame(x=inter.arrival.times)
bw <- diff(range(inter.arrival.times))/50
ggplot(df, aes(x)) +
geom_histogram(aes(y = ..density..),
color = "black",
fill = "white",
binwidth = bw) +
labs(x = "x", y = "Density") +
stat_function(fun = dexp, colour = "blue", args=list(rate=1/mean(df$x)))
This looks ok. Now for the test:
library(KScorrect)
LcKS(inter.arrival.times, "pexp")$p.value
## [1] 0.9096
and so it seems the interarrival times do come from an expoential distribution.
Can we also test for the other conditions of a Poisson process, for example stationary increments? This implies that the number of arrivals over equal length time periods has the same distribution. Let’s check that for time periods of on hour:
N <- NULL
for(i in 1:20) {
tmp <- strsplit(Day[[i]], "-")
hours <- as.numeric(unlist(tmp)[ c(TRUE, FALSE, FALSE) ])
N <- c(N, as.numeric(table(hours)))
}
df <- data.frame(x=1:length(N), N=N)
ggplot(data=df, aes(x, N)) +
geom_point()
and it does not appear that there is a change in the distribution over time.