Princeton University Data and Statistical 
Services Princeton University Library

Search DSS





Finding Data Analyzing Data About Us

DSS lab consultation schedule
(Monday-Friday)
Sep 4-Oct 31By appt. here
Nov 3-Dec 12Walk-in, 2-5 pm*
Dec 15-Jan 31By appt. here
Feb 2-May 3Walk-in, 1-5 pm*
May 4-May 12Walk-in, 2-5 pm*
May 13-Sep 3By appt.
For quick questions email data@princeton.edu.
*No appts. necessary during walk-in hrs.
Note: the DSS lab is open as long as Firestone is open, no appointments necessary to use the lab computers for your own analysis.

Follow DssData on Twitter
See DSS on Facebook
Home Online help Analysis How to Use a Codebook

How to Use a Codebook

These instructions explain what information you should look for when using a codebook, as well as how to translate the information in the codebook to the statements you will need to write SAS, SPSS, or Stata programs to read and analyze the data.

Before looking for a codebook, you first need to determine if you actually need the data, or if you just need the results of the study, i.e., how many people live in New York. Sometimes you won't need the data at all, you can just use one of the many statistical reports or abstracts available in the library. If, in fact, you do need the data to do analyses, then you need to find a study or studies that investigated what you are looking at and carefully read the codebook to make sure that the study has the kind of data you need.

Data Files

Since a codebook describes data files, it would be useful at this point to discuss what data files are and the many formats in which they come. A data file is simply a computer file that has data in it. Most data files are arranged like spreadsheets where you have lines of information from each observation (a person, a state, or a company) and columns of information representing different variables. The main difference between a spreadsheet and a data file is that each column in a spreadsheet is equal to one variable in a data file. Each variable of a data file is made up of one or more columns. Sometimes the data file will have spaces between the groups of columns that make up a variable, but most times it will simply run everything together. Here is a sample spreadsheet:

Here is what the same information might look like in a data file:



	12345678901234

	123123.4   190
	243 32.5    12
	355 11.9383843
	412 99     239
	567123    4345
	698 45.7    23
	733 22.5     2
	856 12       0

The first line of numbers isn't actually part of the data, we've put it there so you can see how the columns in a data file relate to the columns in a spreadsheet. In this example, column A in the spreadsheet is column 1 in the data file, column B is columns 2-3, column C is columns 4-8, and column D is columns 9-14. If you look closely, you can see that the actual numbers and letters are the same in both files. Since the information in the data file are all run together you need some way of determining where one variable ends and the next one starts. This, among many other important things, is found in the codebook. This is the simplest format of a data file and most will come like this. The two examples above have one "line," "record," or "card" of data for each observation. Often, though, a data file will have more than one line of data for each observation. This is a hold-over from the early days of computing when all the data were entered on punch cards which had only 80 columns. If a survey had more questions than could fit on one card, then researchers had to continue the data on another card. This is particularly true for files that have information from the same observation for several years. Here is an example:

	1 1991 12123
	1 1992 45 34
	1 1993 63 88
	2 1991 34678
	2 1992 55456
	2 1993 76 44
	3 1991 44234
	3 1992 32 56
	3 1993 67 55

This file is very much like the one above, except that each observation has three lines in the file rather than just one. The information in a specific column or columns may or may not represent the same variable. If questions were dropped or added in subsequent years, then the information will be different. Also, if it is an old data file, then it is likely that each card is just a continuation of data from the same time period.

A corollary to multiple cards is hierarchical files. Hierarchical files typically have just one line of data for each observation, however, each line may represent varying levels of information. Perhaps the best example of a hierarchical file is the Current Population Survey. In the CPS file there are three types of records or lines: Household records have information that is common to everyone who lives in that household; Family records have information that is common to everyone in a particular family in that household (more than one family can live in a household); and Person records have, of course, information pertaining to one specific person in that family. All of this information is contained in one file. The household record is always first, followed by the family record, and finally the person record. Each line in the file has a variable or column denoting what type of record it is. Here is an example of what a hierarchical file might look like:

	H 12 321
	F 32 5 3
	P 45 1 5
	P 66 7 3
	P 76 9 7
	H 45 9 9
	F678 3 5
	F567 4 6
	P8992187
	P689 3 0
	P66567 9
	P554 5 9
	P 89 8 9

Hierarchical files can be very tricky to program. If you need to analyze a hierarchical file, you should come to the DSS lab and speak with a consultant about how to do so. Of course, all of these examples have just a few variables, whereas a real data file will have many, many more.

Codebooks

Now that we know what a data file is, we can make more sense out of what a codebook is. A codebook is a technical description of the data that was collected for a particular purpose. It describes how the data are arranged in the computer file or files, what the various numbers and letters mean, and any special instructions on how to use the data properly. Like any other kind of "book," some codebooks are better than others. The best codebooks have:

  1. Description of the study: who did it, why they did it, how they did it.
  2. Sampling information: what was the population studied, how was the sample drawn, what was the response rate.
  3. Technical information about the files themselves: number of observations, record length, number of records per observation, etc.
  4. Structure of the data within the file: hierarchical, multiple cards, etc.
  5. Details about the data: columns in which specific variables can be found, whether they are character or numeric, and if numeric, what format.
  6. Text of the questions and responses: some even have how many people responded a particular way.

Even though a codebook has (or at least, should have) all of this information, not all codebooks will arrange it in the same manner. Later in this document we will show you what information you will need to write the program to read the data.

Before you decide on a particular dataset, there are some things you need to verify before you can make good use of the data:

  1. The wording and presence of the questions and answers. In a study that is done repeatedly, the questions asked and the answers allowed can change considerably from one "wave" to the next, not to mention that some are dropped and new ones added. Also, subtle differences in wording can mean very big changes in how you interpret your results.
  2. The sampling information. A survey that was conducted to measure national attitudes toward a subject may not be good for assessing those same attitudes in specific states.
  3. Weights. Sometimes, in order to properly analyze the data, you will need to apply weights to certain variables. These weights are determined by the sampling procedure used to collect the data.
  4. Flags. Flags perform a function similar to weights in the they tell you if and when a special procedure was used to create the variable. This is common when a person refuses or cannot answer a question, but an interviewer can answer for them.
  5. The column and line location of the variables in the file. This can change from wave to wave also.

Once you have determined that a data file has what you want, you can begin the task of writing the program that will extract or subset those variables in which you are interested. The choice of which software package to use is up to you. You should be aware, however, that most of Princeton's data collection is accessible only on PUCC which has only SAS and SPSS. In any case, it is always a good idea to talk to a Consultant before you try extracting the data.

Writing the Program

Before you can write the program, you will need to be able to locate this information about each variable you will want to use:

  1. The column in which the variable you want starts.
  2. The column in which it ends, or how many columns the variable occupies.
  3. Whether the variable is in numeric or character (also called alphanumeric).
  4. If the variable is numeric, how many decimal places it might have, and if it is stored in a special format such as "zoned decimal."
  5. If you are using data from several years, then you will need to make sure that the above information is the same for each year. If it is not, then you need to gather this information for each year.

For examples please click here

Coding when there is just one line of data for each observation:

In many instances, the data file will have one record per observation. In these instances, you will only need to know the column locations of the variables you want. Here are two examples from the General Social Survey Codebook:

This variable is coded as numeric and can be found in column 240 of the data file. As you can see from the column labeled "PUNCH" above, there are ten categories of responses to this question. Categories 8 ("Don't know") and 9 ("No answer") are often re-coded by analysts to "missing" so that they don't influence any of the statistics computed on this variable. Depending on your specific questions, category 7 ("Other party, refused to say") may also need to be coded as missing. Sometimes, variables are entered as letters instead of numbers, such as if a person's name were entered into the data file. In these instances, you must tell the computer that there are letters instead of numbers. The example below shows how to code this variable as if it were A) numeric and, B) character:

SAS: SPSS: Stata:
A) partyid 238 partyid 238 _column(238) partyid
B) partyid $ 238 partyid (a) 238 _column(238) string partyid

Although this codebook gives a name to the variable (partyid), not all codebooks do. Sometimes the variables are simply numbered. You do not always have to use the names or numbers provided as your own variable names, however, using the ones provided will make referring to the codebook later on much easier. This is important if you thought a variable should have only two categories of responses, but five show up in the data; you may have programmed the wrong columns or lines. It also allows comparison of results of analyses conducted on the same data by different researchers. Sometimes, the names provided are not allowable in whatever statistical package you are using because they are too long or have special characters in them. In these cases, you should refer to the user manual of whatever package you are using to determine what names are permissible. If you do change the variable names, be sure to make a list of these changes.

Often, a variable must have more than one column, such as a person's age. Here is an example of a variable that takes more than one column:

In this example, the variable can occupy two columns, 275-276 in the data file. The coding for this is much the same as for the one above:

SAS: SPSS: Stata:
A) polviewx 275-276 polviewx 275-276 _column(275-276) polviewx
B) polviewx $ 275-276 polviewx (a) 275-276 _column(275-276) string polviewx

If the variable were to have more than two columns, you would simply specify the beginning and ending columns indicated. Sometimes, the codebook will tell you in which column the variable begins and how many columns it occupies (also referred to as its "length"). Look at this example from the Current Population Survey :

D A-WKSLK 2 97 (00:99) Item 22C - 1) How many weeks has ... been looking for work 2) How many weeks ago did ...start looking 3) How many weeks ago was ...laid off

It says that A-WKSLK is numeric, begins in column 97 and has a length of 2 (the instructions in the codebook explains this). In terms of the first example, that means this variable can be found in columns 97-98. Character variables would be indicated the same way. You can write the statements to read these variables like the ones above (a_wkslk 97-98), but if you have many variables, it would be time-consuming to calculate all the specific columns. Instead, you could do it like this:

SAS: SPSS: Stata:
A) @97 a_wkslk 2. a_wkslk 97 (f2.0) _column(97) a_wkslk %2f
B) @97 a_wkslk $2.a_wkslk 97 (a2) _column(97) a_wkslk %2s

You can readily see the similarities and differences among these. In all, the "2" refers to the number of columns the variable occupies in the data file, not necessarily how many digits there are in the variable (some columns may be blank). This is especially important if your data has decimals. For example, if a variable called "varname" were to have a length of 5 and 2 decimal places in it, then the coding would be as follows:

SAS: SPSS: Stata:
@124 varname 5. 2 varname 124 (f5.2) _column(124) varname %5.2f

This means that "varname" occupies a total of five columns in the data file. Two of those columns are the numbers on the right of the decimal, one is the decimal itself, and the last two columns are the numbers on the left of the decimal. Therefore, the largest number that could be coded into this space is 99.99. Once in a while, a codebook will tell you that there are "implied" decimal places. This means that the decimal was not actually entered into the data and you must assume (and correctly program) that the last however many digits are on the right of the decimal.

Coding for more than one line of data for each observation:

You need to pay special attention to how many lines there are for each observation, and on what line the variable you are interested in can be found. Every codebook will indicate what line the variable can be found differently, so you must look in the introductory pages to see how this is done. Failure to keep track of what line the variable is on will result in reading from the wrong line and thus, reading the wrong information for that variable.

Let's assume that in Example 2 above, there are five lines of data for each observation. Let's further assume that varname is found on the first line for an observation and that charname is found on the third line. Here are the statements you would need to read these variables:

SAS:
data one;
infile example n=5;
input
#1 @124 varname 5.
#3 @155 charname $12.
SPSS:
data list file='mydata.dat' records=5.
/1 varname 124-128
/3 charname 155-166 (a).
Stata:
infile dictionary {
_lines(5)
_line(1)
_column(124) varname %5f
_line(3)
_column(155) string charname %12s
}

As you can see, in each program you need to tell the program how many lines there are for each observation ("n=5", "lines=5", and "_lines(5) ). Each program also has a different way of identifying which line you want to read ("#1", /1 , "_line(1)" ). If you wanted to read other variables from lines 1 or 3, you could simply list them together without repeating the line pointer for each variable. The program will continue reading from the same line of data until you tell it to go to the next line.

For info on how to read data in ASCII format using data layout click here

Conclusion

This has been a brief and very general introduction to data files and codebooks. We could not possibly cover everything you might encounter in using a codebook. So, if you do find something you don't understand, ask a consultant!


This page last updated on: