Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and chemometrics......click here for more.


Time Series - Establishing ARIMA models

The process of finding appropriate ARIMA models has been studied intensively. As a result, detailed guidelines exist. The method described in [Box and Jenkins, 1970] is referred to as the "Box-Jenkins approach."
 

1. Model Selection

In the model selection phase, a single model is chosen. This requires determining the values p, d, and q of an ARIMA[p,d,q]-model. In this phase, it is important to collect as much relevant information on the time series as possible. The first steps involve the filtering of trends and the removal of seasonal effects. The correlation function can be inspected to reveal the best choice of d. Heuristics exist as to which filter to use, depending on the shape of the correlation function: for instance, when it is descending, alternatingly positive and negative, when it has peaks, or when it is periodical. There exist heuristics guiding the selection of the appropriate model for time series with p<= 2 and q <= 2. Surprisingly, the majority of time series can be modeled very well with such simple models. The auto-correlation function (ACF) and the partial auto-correlation function (PACF) can be used for determining p and q of the ARIMA[p,d,q]-models. They are determined for a limited number of time lags τ, e.g. 20. Then, confidence intervals (e.g. 95% intervals) are calculated. The time lags τ lying outside the confidence intervals can be taken as p and q. Those found outside the confidence interval around the ACF function indicate that a MA[τ] model should be used, and those of the PACF function indicate that an AR[τ] model may be applicable.
 

2. Parameter Estimation

In order to estimate the time series value x(t) with an ARIMA[p,d,q]-model, p, d, and q have to be selected first. The number of differentiation steps d determines how often the original time series is differentiated before the respective formula is applied. This procedure is required for filtering trends.

When p, d, and q of an ARIMA model are given, the parameters αi and βj can be estimated. This is done by minimizing (some function of) the error. This is the distance between the time series produced by the original time series and the time series produced by the model. When d is used, i.e. 0<d, the errors for the d-th derivative of the time series are taken. The "least squares approach" is the most common technique. It minimizes the squared errors.

Depending on the overall task, other performance measures may be formulated to measure the quality of the model. It is often used as default, but other measures may be more reasonable for a given application.
 

3. Performance Checking

To check the performance, it is important to use independent test sets consisting of time series which have not yet been involved in the modeling process. The error on these independent test sets is compared to that obtained with other models. Usually, the error is a value obtained by applying some function on the difference between the observed and the forecast value.

Box and Jenkins advise taking a look at the autocorrelation functions of the time series and of the errors. If the latter contains any suspicious peaks, the model does not exploit all the available information. Moreover, it is reasonable to evaluate the performance of ARIMA models of higher order: ARIMA[p+1,d,q] and ARIMA[p,d,q+1]. This shows whether models of higher order improve the forecasts. If a model does not provide better forecasts, the model of lower order is preferred, because it has fewer parameters. In order to avoid under- and overdifferentiation, the models with higher and lower d (ARIMA[p,d-1,q] and ARIMA[p,d+1,q]) should also be tested. Finally, more complex models may be checked.