A time series is any data collected over a period of time. Examples are
The main feature that sets time series apart is that the assumption of independence usually fails. If I wanted to guess the size of the human population in 2019, knowing what it is in 2018 might be useful.
data set of the number of births per month in New York city, from January 1946 to December 1959 (originally collected by Newton)
kable.nice(matrix(births[1:50], nrow=10))
26663 | 21672 | 23105 | 23950 | 21761 |
23598 | 21870 | 23110 | 23504 | 22874 |
26931 | 21439 | 21759 | 22238 | 24104 |
24740 | 21089 | 22073 | 23142 | 23748 |
25806 | 23709 | 21937 | 21059 | 23262 |
24364 | 21669 | 20035 | 21573 | 22907 |
24477 | 21752 | 23590 | 21548 | 21519 |
23901 | 20761 | 21672 | 20000 | 22025 |
23175 | 23479 | 22222 | 22424 | 22604 |
23227 | 23824 | 22123 | 20615 | 20894 |
the first step is to store the data in a time series object in R, so that we can use R’s many functions for analyzing time series data. One way to do that is the base R function ts.
by default ts assumes equal space time units of one year. For other time spans we can use the argument frequency. For example, if data is by month, use frequency=12. We can also change the starting point:
birth.ts <- ts(births, frequency=12, start=c(1946, 1))
head(birth.ts)
## [1] 26663 23598 26931 24740 25806 24364
number of deaths per month in Puerto Rico, according to the official deaths statistics of the government:
Deaths <- c(2744, 2403, 2427, 2259, 2340, 2145,
2382, 2272, 2258, 2393, 2268, 2516,
2742, 2592, 2458, 2241, 2312, 2355,
2456, 2427, 2367, 2357, 2484, 2854,
2894, 2315, 2494, 2392, 2390, 2369,
2367, 2321, 2928, 3040, 2671, 2820,
2821, 2448, 2643, 2218)
pr.deaths.ts <- ts(Deaths, frequency=12,
start=c(2015, 1))
Data set has the daily closing values of the Dow Jones Industrial Index from January 2000 to November 2018 and the weekly closing values from January 1985 to November 2018.
the data set is available at dow.jones.rds
dow.jones <- readRDS("C:\\Users\\Wolfgang\\Dropbox\\teaching\\Computational-Statistics-with-R\\dow.jones.rds")
kable.nice(head(dow.jones$Weekly))
Date | Close |
---|---|
1985-01-28 | 1277 |
1985-02-04 | 1289 |
1985-02-11 | 1282 |
1985-02-18 | 1275 |
1985-02-25 | 1299 |
1985-03-04 | 1269 |
Again we want to turn this into a time series object. Here however we have the problem that the time variable is not equal spaced (sometimes the New York stock exchange is not open on a Monday or a Friday). One way to do this is to use the package zoo:
library(zoo)
dj.ts <- zoo(dow.jones$Weekly$Close,
order.by=as.Date(as.character(dow.jones$Weekly$Date)))
There is an obvious graph for a time series, namely a line graph by time. A ts object already has a plot function, so
library(ggfortify)
autoplot(birth.ts)
autoplot(pr.deaths.ts)
here we have a strong seasonal trend, deaths go up a lot in the winter. And of course we have a spike around September 2017!
autoplot(dj.ts)
In general a time series consists of some (or all) of these parts:
here we seem to have all of these:
One of the tasks in a time series analysis is to decompose the series into these parts.
births.dec <- decompose(birth.ts)
plot(births.dec)
so we see a clear trend and seasonal component
an alternative to decompose is the stl routine, which uses loess to do the fitting:
pr.deaths.dec <- stl(pr.deaths.ts, "periodic")
plot(pr.deaths.dec)
again a clear trend and seasonal component.
dj.dec <- stl(dj.ts, "periodic")
## Error in na.fail.default(as.ts(x)): missing values in object
and this gives an error. That is because this time series has irregular time points. It can be quite a chore to “fix” this problem. If the time periods are somewhat regular (in our case they are almost one week) it is easier to make a new series with a regular time scale:
dj.alt.ts <- ts(dow.jones$Weekly$Close, frequency = 52,
start=c(1985, 1))
dj.alt.dec <- stl(dj.alt.ts, "periodic")
plot(dj.alt.dec)
which shows a clear trend but not any seasonality. We also see that the variation is increasing over time.
As we said before, the special feature of a time series is a dependence between neighboring points. This suggests a way to analyse a time series by analyzing the correlation between time points.
Let’s define a new variable: \(Y_i=X_i-X_{i-1}\), that is the change in the number of births from one month to the next.
y <- diff(births)
plot(y)
here we now longer have any pattern, which shows us that this series consisted only of a trend (which we eliminated in y).
There are some R routines to do this for us for a time series:
acf(pr.deaths.ts)
which shows a clear seasonal pattern.
the above is an example of a general model for time series called ARIMA (auto-regressive integrated moving average).
Here are a couple of examples:
\[ X_i=c+\alpha_1 X_{i-1} +\epsilon_i \]
for the births data we saw that \(X_i-X_{i-1}\) was essentially random noise, so it seems that series can be modeled as an AR(1).
\[ X_i=c+\alpha_1 X_{i-1}+ \alpha_2 X_{i-2} +\epsilon_i \] - MA(1)=AR(0, 0, 1)
\[ X_i=c+\beta_1 \epsilon_{i-1} + \epsilon_i \]
finally ARIMA combines both variations.
let’s create some artificial data to see what is going on:
eps <- rnorm(100)
x <- sort(runif(100, 1, 10))
x.ar <- ts(5 + x[-1] + 0.2* x[-100] + eps[-100])
plot(x.ar)
acf(x.ar)
x.ma <- ts(5 + 2* eps[-1] + eps[-100])
plot(x.ma)
acf(x.ma)
so the acf plot of x.ma shows us that indeed we have a moving average process, with a lag of 1 or 2.
Fitting an ARIMA model can be done with the auto.arima routine from the forcast package:
library(forecast)
auto.arima(x.ar)
## Series: x.ar
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## -0.8492 0.1159
## s.e. 0.0626 0.0168
##
## sigma^2 estimated as 1.096: log likelihood=-143.16
## AIC=292.31 AICc=292.57 BIC=300.07
The most general way to analyse a time series is a spectral analysis. In essence, we assume that
\[ X_t=\sum_{j=1}^k \left(A_j \sin(2 \pi \nu_j t) +B_j \cos(2 \pi \nu_j t) \right) \] and a spectral analysis amounts to estimating the parameters \(A_j, B_j, \nu_j\).
A fuller discussion of this is beyond our course. It would require first a discussion Fourier analysis, one of the most widely used technics in mathematics and engineering.
The usual starting point is a look at the periodogram
par(mfrow=c(2, 2))
spec.pgram(birth.ts, spans = c(3, 3))
spec.pgram(birth.ts, spans = c(3, 5))
spec.pgram(birth.ts, spans = c(7, 11))
spec.pgram(birth.ts, spans = c(7, 21))
and so we see that larger values of span do more smoothing and show the underlying patters
spec.pgram(pr.deaths.ts, span= c(3, 3))
spec.pgram(dj.alt.ts, span= c(3, 9))
and as we know there is no seasonal component here.