## How to handle autocorrelation

Written by San (SPSS, Excel & Stata – Data mining and Econometric Modeller)

Written by San (SPSS, Excel & Stata – Data mining and Econometric Modeller)

The correlation of a time series with its own past and future values causes autocorrelation. Generally, any usage has a tendency to remain in the same state from one observation to the next. This specific form of ‘persistence’ causes the positive autocorrelation.

In regression analysis using panel data, autocorrelation of the error terms violates the OLS assumption that the error terms are uncorrelated. It is important to note that it does not bias the OLS coefficient estimates. However, the standard errors tend to be underestimated.

For example, let’s say a model is specified as:

Y_{t} = a + bX_{t} + m_{t}

where,

Y_{t} – Dependent variable at time t

a – Constant

X_{t} – Independent variable at time t

m_{t}– Error term at time t

Because of the panel nature of the data, the error terms for different years are correlated with one another. For example, the error term in year t are correlated with the error term in year t-1. This is because panel data tends to follow trends. The assumption of autocorrelation is expressed as follows:

Cov(mt , ms) = E(mtms) ¹ 0

Which indicates that autocorrelation occurs whenever the error term for period *t* is correlated with the error term for period *s*.

Traditionally, the Durbin-Watson statistic is used to identify the presence of first-order autocorrelations or Durbin’s *h* statistic if the explanatory variables include a lagged dependent variable. However, the xtabond2 procedure in Stata includes the Arellano-Bond test for autocorrelations in first differences. The Arellano-Bond test for autocorrelation has a null hypothesis of no autocorrelation and is applied to the differenced residuals. The test for AR (1) process in first differences usually rejects the null hypothesis, if the first lag of dependent variable is used.

After an extensive literature review and consultations with experts in this field, the following actions can experimented to reduce the autocorrelations.

- – Adding other variables as independent variables
- – Transforming variables into different functional forms
- – Clustering on different time invariant factors
- – Experimenting with model specification

Many different variables can be introduced to the model to reduce the autocorrelations. For example, instead of only one lag of the dependent in the model, more lag of the dependent variables can be experimented, as the value one year ago may be more important than the value one quarter ago. A few other variables, such as the gross domestic product (GDP) index instead of household income can be experimented with. Interaction between weather and seasonal variables can be also experimented. Most of these variables may proved not to be adding any more value to the model specifications.

Many variables can be transformed into different forms and tested to see if the autocorrelations were reduced. Some of the transformations are:

- – Deviation from average values
- – Log form
- – Exponential form
- – Annualised

Generally, the deviation from the average values worked best for weather data. Log form or exponential form may or may not make improvements. Annualising the data is supposed to remove any seasonal effects.

Clustering based on variables, such as income and lot size, can improve the autocorrelations problems. This may be due to the phenomenon of seasonality, with houses reacting differently to the same weather conditions. However, the *p* values of estimates can get worse as the number of households within each cluster is smaller than the whole segment. Hence, the clustering analysis can be used to identify outliers and appropriate actions can be taken.

By default, xtabond2 applies the system GMM. However, difference GMM estimator can be performed by adding the command ‘noleveleq’. These two specifications can be experimented on each variable.

Regardless of what actions were taken the fact remains that there is the lot of persistence in the data which causes the autocorrelations problem.