Princeton University Library Data and Statistical 
Services

Search DSS





Finding Data Analyzing Data Citing data

About Us


DSS lab consultation schedule
(Monday-Friday)
Sep 4-Nov 3By appt. here
Nov 4-Dec 12Walk-in, 2-5 pm*
Dec 15-Jan 31By appt. here
Feb 2-May 3Walk-in, 1-5 pm*
May 4-May 12Walk-in, 2-5 pm*
May 13-Sep 3By appt.
For quick questions email data@princeton.edu.
*No appts. necessary during walk-in hrs.
Note: the DSS lab is open as long as Firestone is open, no appointments necessary to use the lab computers for your own analysis.

Follow DssData on Twitter
See DSS on Facebook

Home Online Help Statistical Packages Stata Data, Datasets and Variables

Data, Datasets and Variables

Data management

Setting working directory, log file, openning/saving a Stata datafile, Stata color coding system, renaming, recoding and creating new variables, droping cases, deleting variables, merge, append, frequencies, crosstabulations and descriptive statistics click here

Data Files

A data set is just a file in which rows represent observations and columns represent variables. For example, an observation might be a car, and the variables would be pieces of information about the car, such as the make, length, price, and gear-ratio:

make,length,price,gear_ratio
"AMC Concord",186,4099,3.58
"AMC Pacer",173,4749,2.53
"AMC Spirit",168,3799,3.08
"Buick Century",196,4816,2.93
"Buick Electra",222,7827,2.41
"Buick LeSabre",218,5788,2.73
"Buick Opel",170,4453,2.87

If data is already in Stata's proprietary file format, it will have the extension dta, for example mydata.dta. Data in this format can be read directly into Stata with the use command.

	. use mydata

If Stata gives you the error message

     no room to add more observations

when you try to open a data file, see here for information on how to fix the problem.

Stata can read data sets in various text formats as well as in Stata's proprietary format. Often you will start with data in text format, read it into Stata, and save it in Stata format.

You may also come across data in various other formats. For example, data from certain data archives is often formatted for the statistical package SPSS. A program called DBMS/Copy, available in the DSS lab as well as on Windows machines in the OIT public clusters, can convert data from SPSS and from many other formats to Stata format quickly and easily.

A common text format is the delimited file. Delimited files are most commonly tab- or comma-delimited. This just means that the variables in each observation are entered one after the other on a line and separated by tabs or commas, while the observations are separated by hard returns. The example above is actually how a comma-delimited text file would look if opened in Word.

The command syntax to read in a tab- or comma-delimited file is:

     insheet using [filename]
where filename is the name of the file that contains the tab- or comma-delimited data.

insheet is often used to read spreadsheets saved as "csv" (comma-delimited) files from a package such as Excel. Please note that a spreadsheet needs to be put in a "Stata-friendly" form before Stata will be able to read it in appropriately. Failure to do so may cause headaches.

For further details, see

There are two commands other than insheet - infile and infix - that read other, less common types of text files. If you have space-delimited data, fixed width data, or come across a Stata data dictionary, see

Saving Stata files (datasets)

You can use the Stata save command to save a file in Stata format:

     save [filename]
where filename is the name of your Stata file. For example:
     save myfile
will save a Stata file named "myfile.dta." This file can be read in Stata with the use command. Note that the ".dta" file extension is automatically appended to Stata files. You do not have to include the file extension on the use or save commands.

If you already have a Stata file named "myfile.dta" and wish to save an updated version of the file under the same name, then use the Stata save command with the replace option, as in:

     save [filename], replace
where filename is the name of the file you wish to replace, e.g.:
     save myfile, replace
To save an updated version of the active file, you can simply type:
     save, replace
This command will destroy the previous version of your file, so use the replace option only if you are certain that you will not need the older version of your file. There is no way to retrieve your original file once another file has written over it.

Missing Values

Sometimes a variable is missing for some observations. (Missing means that there is no value - the person didn't answer the survey question, or the data could not be acquired for some other reason.) In Stata, missing values in numeric variables are represented by a period (.). Observations with missing values are left out of tables produced by tab, and are also left out of regressions. They appear as periods in the stata data browser and are represented by periods in commands. Missing string values appear as blank cells in the browser, and are represented in commands by two double quotes with nothing in between them (""). What we mean by "represented in commands" will make more sense a little later.

Remember that if you are saving data out of Excel, the missing values need to have been left blank for Stata to recognize them as missing.

Stata Variables Types

There are two types of variables in Stata: numeric and string. A third type, date, is really a special type of numeric, as we will see. Numeric variables are simple - they contain numbers. String variables contain text which can contain any characters on the keyboard: letters, numbers, and special characters. On auto3, make is a string variable - all the others are numeric. We can do numeric calculations and statistical analysis on numeric variables - we can't on string variables. String variables are usually used as identifiers for the observation.

One of the numeric variables, date, is intended to represent a date - let's say it was the date the data about each car was collected. In Stata, dates are numbers that represent the number of days since January 1, 1960. Representing dates as numbers this way allows us to do calculations on them, like measuring the length of time between two dates. But it looks weird, and Stata has a simple way to make date variables look like dates:

. format date %d

(The format command has other uses, which you can see by typing "help format".)

You can use the di command together with the d() function to display the Stata value of any date:

. di d(15apr1998)

For more information on how Stata handles dates and time data, see

Variable Naming Conventions

The rules for naming variables in Stata are simple:

  1. Stata is case-sensitive, so using all lower case letters in variable names is a good idea.
  2. They can contain no more than 32 characters.
  3. They can contain letters, numbers, or underscores (_).
  4. Spaces or other special characters (like &,*,%, etc.) are not allowed.
  5. The first character must be a letter or underscore, not a number. Starting variable names with underscores is a really bad idea, since Stata's built-in variables begin with an underscore.

Renaming Variables

Changing a variable's name in Stata is easy with the rename command:

	. rename make model
This page last updated on: