Finding Data Citing data
DSS lab consultation schedule
*No appts. necessary during walk-in hrs.
Note: the DSS lab is open as long as Firestone is open, no appointments necessary to use the lab computers for your own analysis.
Panel data, also called longitudinal data or cross-sectional time series data, are data where multiple cases (people, firms, countries etc) were observed at two or more time periods. An example is the National Longitudinal Survey of Youth, where a nationally representative sample of young people were each surveyed repeatedly over multiple years.
There are two kinds of information in cross-sectional time-series data: the cross-sectional information reflected in the differences between subjects, and the time-series or within-subject information reflected in the changes within subjects over time. Panel data regression techniques allow you to take advantage of these different types of information.
While it is possible to use ordinary multiple regression techniques on panel data, they may not be optimal. The estimates of coefficients derived from regression may be subject to omitted variable bias - a problem that arises when there is some unknown variable or variables that cannot be controlled for that affect the dependent variable. With panel data, it is possible to control for some types of omitted variables even without observing them, by observing changes in the dependent variable over time. This controls for omitted variables that differ between cases but are constant over time. It is also possible to use panel data to control for omitted variables that vary over time but are constant between cases.
Using Panel Data in Stata
A panel dataset should have data on n cases, over t time periods, for a total of n × t observations. Data like this is said to be in long form. In some cases your data may come in what is called the wide form, with only one observation per case and variables for each different value at each different time period. To analyze data like this in Stata using commands for panel data analysis, you need to first convert it to long form. This can be done using Stata's reshape command. For assistance in using reshape, see Stata's online help or this web page.
Stata provides a number of tools for analyzing panel data. The commands all begin with the prefix xt and include xtreg, xtprobit, xtsum and xttab - panel data versions of the familiar reg, probit, sum and tab commands.
To use these commands, first tell Stata that your dataset is panel data. You need to have a variable that identifies the case element of your panel (for example, a country or person identifier) and also a time variable that is in Stata date format. For information about Stata's date variable formats, see our Time Series Data in Stata page.
Sort your data by the panel variable and then by the date variable within the panel variable. Then you need to issue the tsset command to identify the panel and date variables. If your panel variable is called panelvar and your date variable is called datevar, the commands needed are:
. sort panelvar datevar . tsset panelvar datevar
If you prefer to use menus, use the command under Statistics > Time Series > Setup and Utilities > Declare Data to be Time Series.
Fixed, Between and Random Effects models
Fixed Effects Regression
Fixed effects regression is the model to use when you want to control for omitted variables that differ between cases but are constant over time. It lets you use the changes in the variables over time to estimate the effects of the independent variables on your dependent variable, and is the main technique used for analysis of panel data.
The command for a linear regression on panel data with fixed effects in Stata is xtreg with the fe option, used like this:
xtreg dependentvar independentvar1 independentvar2 independentvar3 ... , fe
If you prefer to use the menus, the command is under Statistics > Cross-sectional time series > Linear models > Linear regression.
This is equivalent to generating dummy variables for each of your cases and including them in a standard linear regression to control for these fixed "case effects". It works best when you have relatively fewer cases and more time periods, as each dummy variable removes one degree of freedom from your model.
Regression with between effects is the model to use when you want to control for omitted variables that change over time but are constant between cases. It allows you to use the variation between cases to estimate the effect of the omitted independent variables on your dependent variable.
The command for a linear regression on panel data with between effects in Stata is xtreg with the be option.
Running xtreg with between effects is equivalent to taking the mean of each variable for each case across time and running a regression on the collapsed dataset of means. As this results in loss of information, between effects are not used much in practice. Researchers who want to look at time effects without considering panel effects generally will use a set of time dummy variables, which is the same as running time fixed effects.
The between effects estimator is mostly important because it is used to produce the random effects estimator.
If you have reason to believe that some omitted variables may be constant over time but vary between cases, and others may be fixed between cases but vary over time, then you can include both types by using random effects. Stata's random-effects estimator is a weighted average of fixed and between effects.
The command for a linear regression on panel data with random effects in Stata is xtreg with the re option.
Choosing Between Fixed and Random Effects
The generally accepted way of choosing between fixed and random effects is running a Hausman test.
Statistically, fixed effects are always a reasonable thing to do with panel data (they always give consistent results) but they may not be the most efficient model to run. Random effects will give you better P-values as they are a more efficient estimator, so you should run random effects if it is statistcally justifiable to do so.
The Hausman test checks a more efficient model against a less efficient but consistent model to make sure that the more efficient model also gives consistent results.
To run a Hausman test comparing fixed with random effects in Stata, you need to first estimate the fixed effects model, save the coefficients so that you can compare them with the results of the next model, estimate the random effects model, and then do the comparison.
. xtreg dependentvar independentvar1 independentvar2 independentvar3 ... , fe . estimates store fixed . xtreg dependentvar independentvar1 independentvar2 independentvar3 ... , re . estimates store random . hausman fixed random
The hausman test tests the null hypothesis that the coefficients estimated by the efficient random effects estimator are the same as the ones estimated by the consistent fixed effects estimator. If they are (insignificant P-value, Prob>chi2 larger than .05) then it is safe to use random effects. If you get a significant P-value, however, you should use fixed effects.
Panel Data Analysis (fixed & random