Princeton University Library Data and Statistical 
Services

Search DSS





Finding Data Analyzing Data Citing data

About Us


DSS lab consultation schedule
(Monday-Friday)
Sep 4-Nov 3By appt. here
Nov 4-Dec 12Walk-in, 2-5 pm*
Dec 15-Jan 31By appt. here
Feb 2-May 3Walk-in, 1-5 pm*
May 4-May 12Walk-in, 2-5 pm*
May 13-Sep 3By appt.
For quick questions email data@princeton.edu.
*No appts. necessary during walk-in hrs.
Note: the DSS lab is open as long as Firestone is open, no appointments necessary to use the lab computers for your own analysis.

Follow DssData on Twitter
See DSS on Facebook

Home Online Help Analysis Lag Selection

Lag Selection in Time Series Data

When running regressions on time-series data, it is often important to include lagged values of the dependent variable as independant variables. In technical terminology, the regression is now called a vector autoregression (VAR). For example, when trying to sort out the dterminants of GDP, it is likely that last year's GDP is correlated with this year's GDP. If this is the case, GDP lagged for at least one year should be included on the right-hand side of the regression.

If the variable in question is persistent--that is, values in the far past are still affecting today's values--more lags will be necessary. In order to determine how many lags to use, several selection criteria can be used. The two most common are the Akaike Information Criterion (AIC) and the Schwarz' Bayesian Information Criterion (SIC/BIC/SBIC). These rules choose lag length j to minimize: log(SSR(j)/n) + (j + 1)C(n)/n, where SSR(j) is the sum or squared residuals for the VAR with j lags and n is the number of observations; C(n) = 2 for AIC and C(n) = log(n) for BIC.

Fortunately, in Stata 8 there is a single command that will do the math for any number of specified lags: varsoc. To get the AIC and BIC, simply type 'varsoc depvar' in the command window. The default number of lags Stata checks is 4; in order to check a different number, add ', maxlags(#oflags)' after the 'varsoc depvar'. If, in addition, the regression has independent variables other than the lags, include those after the 'maxlag()' option by typing 'exog(varnames)'. The output will indicate the optimal lag number with an asterisk. Then proceed to run the regression using the specified number of lags on the dependent variable on the right-hand side with the other independent variables.

Example:

varsoc y, maxlag(5) exog(x z)

Selection order criteria

endogenous variables:
    y

exogenous variables:
    x z

constant included in models

Sample:       6       20
Obs = 15          

-------------------------------------------------------------------------------
lag     LL        LR      df    p        FPE       AIC       HQIC       SBIC
-------------------------------------------------------------------------------
  0   -45.854        .     .     .   39.70191    6.51381     6.5123    6.65542
  1   -35.849   20.009*    1  0.000  12.04354*   5.31319*   5.31118*   5.50201*
  2   -35.837    0.024     1  0.877  13.92282    5.44493    5.44241    5.68094
  3   -35.305    1.063     1  0.302  15.13169    5.50737    5.50435    5.79059
  4   -35.233    0.145     1  0.703  17.66201    5.63103    5.62751    5.96145
  5   -35.108    0.250     1  0.617   20.7534    5.74767    5.74365     6.1253

From this output, it is clear that the optimal number of lags is 1, so the regression should look like:

reg y l.y x z

(For further options with the varsoc command, see the Time-Series Stata manual.)

For more on lag selection please check Time Series 101

This page last updated on: