Finding Data Citing data
DSS lab consultation schedule
*No appts. necessary during walk-in hrs.
Note: the DSS lab is open as long as Firestone is open, no appointments necessary to use the lab computers for your own analysis.
Data, Datasets and Variables
Setting working directory, log file, openning/saving a Stata datafile, Stata color coding system, renaming, recoding and creating new variables, droping cases, deleting variables, merge, append, frequencies, crosstabulations and descriptive statistics click here
A data set is just a file in which rows represent observations and columns represent variables. For example, an observation might be a car, and the variables would be pieces of information about the car, such as the make, length, price, and gear-ratio:
make,length,price,gear_ratio "AMC Concord",186,4099,3.58 "AMC Pacer",173,4749,2.53 "AMC Spirit",168,3799,3.08 "Buick Century",196,4816,2.93 "Buick Electra",222,7827,2.41 "Buick LeSabre",218,5788,2.73 "Buick Opel",170,4453,2.87
If data is already in Stata's proprietary file format, it will have the extension dta, for example mydata.dta. Data in this format can be read directly into Stata with the use command.
. use mydata
If Stata gives you the error message
no room to add more observations
when you try to open a data file, see here for information on how to fix the problem.
Stata can read data sets in various text formats as well as in Stata's proprietary format. Often you will start with data in text format, read it into Stata, and save it in Stata format.
You may also come across data in various other formats. For example, data from certain data archives is often formatted for the statistical package SPSS. A program called DBMS/Copy, available in the DSS lab as well as on Windows machines in the OIT public clusters, can convert data from SPSS and from many other formats to Stata format quickly and easily.
A common text format is the delimited file. Delimited files are most commonly tab- or comma-delimited. This just means that the variables in each observation are entered one after the other on a line and separated by tabs or commas, while the observations are separated by hard returns. The example above is actually how a comma-delimited text file would look if opened in Word.
The command syntax to read in a tab- or comma-delimited file is:
insheet using [filename]where filename is the name of the file that contains the tab- or comma-delimited data.
insheet is often used to read spreadsheets saved as "csv" (comma-delimited) files from a package such as Excel. Please note that a spreadsheet needs to be put in a "Stata-friendly" form before Stata will be able to read it in appropriately. Failure to do so may cause headaches.
For further details, see
There are two commands other than insheet - infile and infix - that read other, less common types of text files. If you have space-delimited data, fixed width data, or come across a Stata data dictionary, see
You can use the Stata save command to save a file in Stata format:
save [filename]where filename is the name of your Stata file. For example:
save myfilewill save a Stata file named "myfile.dta." This file can be read in Stata with the use command. Note that the ".dta" file extension is automatically appended to Stata files. You do not have to include the file extension on the use or save commands.
If you already have a Stata file named "myfile.dta" and wish to save an updated version of the file under the same name, then use the Stata save command with the replace option, as in:
save [filename], replacewhere filename is the name of the file you wish to replace, e.g.:
save myfile, replaceTo save an updated version of the active file, you can simply type:
save, replaceThis command will destroy the previous version of your file, so use the replace option only if you are certain that you will not need the older version of your file. There is no way to retrieve your original file once another file has written over it.
Sometimes a variable is missing for some observations. (Missing means that there is no value - the person didn't answer the survey question, or the data could not be acquired for some other reason.) In Stata, missing values in numeric variables are represented by a period (.). Observations with missing values are left out of tables produced by tab, and are also left out of regressions. They appear as periods in the stata data browser and are represented by periods in commands. Missing string values appear as blank cells in the browser, and are represented in commands by two double quotes with nothing in between them (""). What we mean by "represented in commands" will make more sense a little later.
Remember that if you are saving data out of Excel, the missing values need to have been left blank for Stata to recognize them as missing.
Stata Variables Types
There are two types of variables in Stata: numeric and string. A third type, date, is really a special type of numeric, as we will see. Numeric variables are simple - they contain numbers. String variables contain text which can contain any characters on the keyboard: letters, numbers, and special characters. On auto3, make is a string variable - all the others are numeric. We can do numeric calculations and statistical analysis on numeric variables - we can't on string variables. String variables are usually used as identifiers for the observation.
One of the numeric variables, date, is intended to represent a date - let's say it was the date the data about each car was collected. In Stata, dates are numbers that represent the number of days since January 1, 1960. Representing dates as numbers this way allows us to do calculations on them, like measuring the length of time between two dates. But it looks weird, and Stata has a simple way to make date variables look like dates:
. format date %d
(The format command has other uses, which you can see by typing "help format".)
You can use the di command together with the d() function to display the Stata value of any date:
. di d(15apr1998)
For more information on how Stata handles dates and time data, see
Variable Naming Conventions
The rules for naming variables in Stata are simple:
Changing a variable's name in Stata is easy with the rename command:
. rename make model