Time Series Analysis

A time series is any data collected over a period of time. Examples are

  • Dow Jones Industrial index
  • Population of Earth
  • Number of AIDS patients at some hospital

The main feature that sets time series apart is that the assumption of independence usually fails. If I wanted to guess the size of the human population in 2019, knowing what it is in 2018 might be useful.

Example: Births in New York

data set of the number of births per month in New York city, from January 1946 to December 1959 (originally collected by Newton)

kable.nice(matrix(births[1:50], nrow=10))
26663 21672 23105 23950 21761
23598 21870 23110 23504 22874
26931 21439 21759 22238 24104
24740 21089 22073 23142 23748
25806 23709 21937 21059 23262
24364 21669 20035 21573 22907
24477 21752 23590 21548 21519
23901 20761 21672 20000 22025
23175 23479 22222 22424 22604
23227 23824 22123 20615 20894

the first step is to store the data in a time series object in R, so that we can use R’s many functions for analyzing time series data. One way to do that is the base R function ts.

by default ts assumes equal space time units of one year. For other time spans we can use the argument frequency. For example, if data is by month, use frequency=12. We can also change the starting point:

birth.ts <- ts(births, frequency=12, start=c(1946, 1))
head(birth.ts)
## [1] 26663 23598 26931 24740 25806 24364

Example: Deaths in Puerto Rico

number of deaths per month in Puerto Rico, according to the official deaths statistics of the government:

Deaths <- c(2744, 2403, 2427, 2259, 2340, 2145, 
            2382, 2272, 2258, 2393, 2268, 2516,
            2742, 2592, 2458, 2241, 2312, 2355,
            2456, 2427, 2367, 2357, 2484, 2854,
            2894, 2315, 2494, 2392, 2390, 2369, 
            2367, 2321, 2928, 3040, 2671, 2820,
            2821, 2448, 2643, 2218)
pr.deaths.ts <-  ts(Deaths, frequency=12, 
                    start=c(2015, 1))

Example: Dow Jones Industrial index

Data set has the daily closing values of the Dow Jones Industrial Index from January 2000 to November 2018 and the weekly closing values from January 1985 to November 2018.

the data set is available at dow.jones.rds

dow.jones <- readRDS("C:\\Users\\Wolfgang\\Dropbox\\teaching\\Computational-Statistics-with-R\\dow.jones.rds")
kable.nice(head(dow.jones$Weekly))
Date Close
1985-01-28 1277
1985-02-04 1289
1985-02-11 1282
1985-02-18 1275
1985-02-25 1299
1985-03-04 1269

Again we want to turn this into a time series object. Here however we have the problem that the time variable is not equal spaced (sometimes the New York stock exchange is not open on a Monday or a Friday). One way to do this is to use the package zoo:

library(zoo)
dj.ts <- zoo(dow.jones$Weekly$Close,
     order.by=as.Date(as.character(dow.jones$Weekly$Date)))

Graphs

There is an obvious graph for a time series, namely a line graph by time. A ts object already has a plot function, so

Example: Births in New York

library(ggfortify)
autoplot(birth.ts)

Example: Deaths in Puerto Rico

autoplot(pr.deaths.ts)

here we have a strong seasonal trend, deaths go up a lot in the winter. And of course we have a spike around September 2017!

Example: Dow Jones

autoplot(dj.ts)


Decomposition

In general a time series consists of some (or all) of these parts:

  • a baseline
  • a trend
  • seasonal component(s)
  • unusual parts
  • random fluctuation

Example: Deaths in Puerto Rico

here we seem to have all of these:

  • a baseline: roughly 2300 deaths per month
  • a trend: from around 2250 in 2015 to 2500 in 2018
  • seasonal component: winter month vs rest of year
  • unusual parts: September 2017

One of the tasks in a time series analysis is to decompose the series into these parts.

Example: Births

births.dec <- decompose(birth.ts)
plot(births.dec)

so we see a clear trend and seasonal component

Example: Deaths in PR

an alternative to decompose is the stl routine, which uses loess to do the fitting:

pr.deaths.dec <- stl(pr.deaths.ts, "periodic")
plot(pr.deaths.dec)

again a clear trend and seasonal component.

Example: Dow Jones

dj.dec <- stl(dj.ts, "periodic")
## Error in na.fail.default(as.ts(x)): missing values in object

and this gives an error. That is because this time series has irregular time points. It can be quite a chore to “fix” this problem. If the time periods are somewhat regular (in our case they are almost one week) it is easier to make a new series with a regular time scale:

dj.alt.ts <- ts(dow.jones$Weekly$Close, frequency = 52,
                start=c(1985, 1))
dj.alt.dec <- stl(dj.alt.ts, "periodic")
plot(dj.alt.dec)

which shows a clear trend but not any seasonality. We also see that the variation is increasing over time.

ARIMA models

As we said before, the special feature of a time series is a dependence between neighboring points. This suggests a way to analyse a time series by analyzing the correlation between time points.

Example: Births

Let’s define a new variable: \(Y_i=X_i-X_{i-1}\), that is the change in the number of births from one month to the next.

y <- diff(births)
plot(y)

here we now longer have any pattern, which shows us that this series consisted only of a trend (which we eliminated in y).

There are some R routines to do this for us for a time series:

Example: PR Deaths

acf(pr.deaths.ts)

which shows a clear seasonal pattern.


the above is an example of a general model for time series called ARIMA (auto-regressive integrated moving average).

Here are a couple of examples:

  • AR(1) = ARIMA(1, 0, 0)

\[ X_i=c+\alpha_1 X_{i-1} +\epsilon_i \]

for the births data we saw that \(X_i-X_{i-1}\) was essentially random noise, so it seems that series can be modeled as an AR(1).

  • AR(2) = ARIMA(2, 0, 0)

\[ X_i=c+\alpha_1 X_{i-1}+ \alpha_2 X_{i-2} +\epsilon_i \] - MA(1)=AR(0, 0, 1)

\[ X_i=c+\beta_1 \epsilon_{i-1} + \epsilon_i \]

finally ARIMA combines both variations.

Example: Simulated data

let’s create some artificial data to see what is going on:

eps <- rnorm(100)
x <- sort(runif(100, 1, 10))
x.ar <- ts(5 +  x[-1] + 0.2* x[-100] + eps[-100])
plot(x.ar)

acf(x.ar)

x.ma <- ts(5 + 2* eps[-1] + eps[-100]) 
plot(x.ma)

acf(x.ma)

so the acf plot of x.ma shows us that indeed we have a moving average process, with a lag of 1 or 2.

Fitting an ARIMA model can be done with the auto.arima routine from the forcast package:

library(forecast)
auto.arima(x.ar)
## Series: x.ar 
## ARIMA(0,1,1) with drift 
## 
## Coefficients:
##           ma1   drift
##       -0.8492  0.1159
## s.e.   0.0626  0.0168
## 
## sigma^2 estimated as 1.096:  log likelihood=-143.16
## AIC=292.31   AICc=292.57   BIC=300.07

Spectral Analysis

The most general way to analyse a time series is a spectral analysis. In essence, we assume that

\[ X_t=\sum_{j=1}^k \left(A_j \sin(2 \pi \nu_j t) +B_j \cos(2 \pi \nu_j t) \right) \] and a spectral analysis amounts to estimating the parameters \(A_j, B_j, \nu_j\).

A fuller discussion of this is beyond our course. It would require first a discussion Fourier analysis, one of the most widely used technics in mathematics and engineering.

The usual starting point is a look at the periodogram

Example: Births

par(mfrow=c(2, 2))
spec.pgram(birth.ts, spans = c(3, 3))
spec.pgram(birth.ts, spans = c(3, 5))
spec.pgram(birth.ts, spans = c(7, 11))
spec.pgram(birth.ts, spans = c(7, 21))

and so we see that larger values of span do more smoothing and show the underlying patters

Example: PR Deaths

spec.pgram(pr.deaths.ts, span= c(3, 3))

Example: Dow Jones

spec.pgram(dj.alt.ts, span= c(3, 9))

and as we know there is no seasonal component here.