Statistical Analysis of Stochastic Processes

Statistics

Note I will assume here that you are familiar with the basic ideas of Statistics.

Most of the basic methods from Statistics apply to stochastic processes as well.

Example

The data set ex1 has 1000 observations from some discrete-time discrete-state space Markov chain:

ex1[1:20]
##  [1] 1 4 1 4 1 4 5 5 3 4 5 3 4 5 5 5 3 2 3 2
ex1[981:1000]
##  [1] 2 3 2 4 5 3 4 5 5 2 4 4 5 2 4 1 5 5 2 2

We want to estimate the transition matrix. Let’s use maximum likelihood. Let \(p_{ij}=P(X_2=j|X_1=i)\) be the transition probabilitites, and let \(n_{ij}\) be the number of times the chain went from i to j, then the likelihood function is given by

\[L(p_{11},..,p_{55})=\prod_{1\le i,j\le 5}p_{ij}^{n_{ij}}\]

and we need to maximize this subject to the conditions \(\sum_{j=1}^5 p_{ij}=1\), i=1,..,5.

As always we use the log-likelihood, and then using Lagrange multipliers we have

\[\sum_{1\le i,j\le 5} n_{ij} \log p_{ij} +\sum_{i=1}^5 \lambda_i\left(\sum_{j=1}^5 p_{ij}-1\right)\].

Now

\[ \begin{aligned} &\frac{d}{dp_{ij}} = \frac{n_{ij}}{p_{ij}}-\sum_{i=1}^5 \lambda_i=0\\ &p_{ij} = n_{ij}/\sum_{i=1}^5 \lambda_i\\ &1 = (\sum_{j=1}^5 n_{ij})/\sum_{i=1}^5 \lambda_i \\ &\sum_{i=1}^5 \lambda_i = n_{i.} \\ &\hat{p}_{ij} = n_{ij}/n_{i.} \end{aligned} \]

n <- matrix(0, 5, 5)
for(i in 2:1000) {
  n[ex1[i-1], ex1[i]] <- n[ex1[i-1], ex1[i]]+1
}
n. <- apply(n, 1, sum)
phat <- round(n/n., 2)
dimnames(phat) <- list(1:5, 1:5)
kable.nice(phat)
1 2 3 4 5
1 0.11 0.34 0.00 0.36 0.19
2 0.00 0.36 0.19 0.45 0.00
3 0.00 0.36 0.17 0.46 0.00
4 0.33 0.00 0.09 0.11 0.46
5 0.10 0.28 0.24 0.00 0.37

Above we started with the assumption that this sequence is from a Markov chain. Can we test this? To do so we have to carry out the following hypothesis test. Let

\[p_{(i,j)k}=P(X_3=k|X_1=i,X_2=j)\]

then the Markov property (plus stationarity) imply

\[p_{(i,j)k}=p_{jk}\]

Applying the likelihood ratio test and the usual Taylor approximation to the logarithm yields the chi-square test statistic

\[X^2=\sum_{i,j,k}\left(n_{ijk}-e_{ijk}\right)^2/e_{ijk}\]

where \(n_{ijk}\) is the number of transitions from i to j to k and \(e_{ijk}=n_{ij.}n_{jk.}/n_{j.}\). Under the null hypothesis of a Markov chain we X2 will have a chi-square distribution with c3 degrees of freedom.

The test is implemented in the markovchain library:

library(markovchain)
verifyMarkovProperty(ex1)
## Testing markovianity property on given data sequence
## Chi - square statistic is: 37.57234 
## Degrees of freedom are: 125 
## And corresponding p-value is: 1

Could this sequence actually come from independent observations? We can of course do the basic test for independence, but notice that we never saw a transition from 3 to 1, which is impossible under independence!

Example

A store wants to find out about the times when customers entered the store. They open at 8am and close at 6pm. For 20 working days they record the times and find

##  [1]  483.1900  485.7712  504.1513  511.2431  533.9519  556.3276  564.5901
##  [8]  600.1509  601.2124  613.2977  615.0961  656.5382  672.0500  678.0181
## [15]  691.2101  738.8729  742.7525  751.9882  767.3979  782.9098  816.6775
## [22]  832.7054  840.4809  861.4346  869.5839  894.2626  925.1999  997.7035
## [29]  998.6705  999.6550 1011.9642 1028.8511 1049.1714
Day[[1]]
 [1] "8-03-11"  "8-05-46"  "8-24-09"  "8-31-14"  "8-53-57"  "9-16-20" 
 [7] "9-24-35"  "10-00-09" "10-01-13" "10-13-18" "10-15-06" "10-56-32"
[13] "11-12-03" "11-18-01" "11-31-13" "12-18-52" "12-22-45" "12-31-59"
[19] "12-47-24" "13-02-55" "13-36-41" "13-52-43" "14-00-29" "14-21-26"
[25] "14-29-35" "14-54-16" "15-25-12" "16-37-42" "16-38-40" "16-39-39"
[31] "16-51-58" "17-08-51" "17-29-10"

so on day 1 the first customer came in at 8h03m11s and the last one at 17h29m10s.

Do the arrivals follow a Poisson distribution?

We will assume that from the details of this store we know that the arrivals of customers have independent and stationary increments, so what we need to do is test whether the interarrival times come from an exponential distribution. For this we can use the Lilliefors test. It is implemented in the package KScorrect.

First, though we need the interarrival times:

inter.arrival.times <- NULL
for(i in 1:20) {
  tmp <- strsplit(Day[[i]], "-")
  hours <- as.numeric(unlist(tmp)[ c(TRUE, FALSE, FALSE) ])
  minutes <- as.numeric(unlist(tmp)[ c(FALSE, TRUE, FALSE) ])
  seconds <- as.numeric(unlist(tmp)[ c(FALSE, FALSE, TRUE) ])
  tmp <- c(8*60, hours*60+minutes+seconds/100)
  inter.arrival.times <- c(inter.arrival.times,
                           diff(tmp))
}

Let’s look at a graph:

df <- data.frame(x=inter.arrival.times)
bw <- diff(range(inter.arrival.times))/50 
ggplot(df, aes(x)) +
  geom_histogram(aes(y = ..density..),
    color = "black", 
    fill = "white", 
    binwidth = bw) + 
    labs(x = "x", y = "Density") +
  stat_function(fun = dexp, colour = "blue", args=list(rate=1/mean(df$x)))

This looks ok. Now for the test:

library(KScorrect)
LcKS(inter.arrival.times, "pexp")$p.value
## [1] 0.9096

and so it seems the interarrival times do come from an expoential distribution.

Can we also test for the other conditions of a Poisson process, for example stationary increments? This implies that the number of arrivals over equal length time periods has the same distribution. Let’s check that for time periods of on hour:

N <- NULL
for(i in 1:20) {
  tmp <- strsplit(Day[[i]], "-")
  hours <- as.numeric(unlist(tmp)[ c(TRUE, FALSE, FALSE) ])
  N <- c(N, as.numeric(table(hours)))
}
df <- data.frame(x=1:length(N), N=N)
ggplot(data=df, aes(x, N)) +
  geom_point() 

and it does not appear that there is a change in the distribution over time.