Search DSS |
Home Online Help Statistical Packages StataCreating and Modifying Variables Creating and Modifying VariablesBefore reading this, make sure that you understand roughly what Stata variables are and how they work. Our page on Data, Datasets and Variables is a good place to start. Variable creation commandsThe basic commands for creating new variables and modifying old ones in Stata are generate (abbreviated gen), egen and replace. The command gen variablename = something creates a new variable named variablename and sets it equal to something. Something can be a simple number, a string, a mathematical expression, or a function of other variables. . gen one = 1 . gen two = 1+1 . gen three = one+two . gen bmi = (weight/(height*height)) * 703 . gen approx_test_score = round(test_score,1)There is another command egen that is also used to create new variables. egen works with a different set of functions than gen. There is no particular logic as to why there are two variable creation commands - it's just an oddity that has to do with the way Stata was written. Most egen functions work across all observations to produce variables that summarize other variables. For example: . egen max_weight = max(weight) . egen sumofallweights=sum(weight) For information on what gen functions do, look up "functions" in Stata's online help. For information on egen functions, lookup "egen". The replace command is used to make changes to existing variables: . gen heavy=1 if bmi>=30 . replace heavy = 0 if bmi<30 Replace works with all gen functions, but not with egen functions. However, you can use replace to modify variables created by egen as well as those created by gen. You normally want to use replace for second and later steps in multi-step variable creations, just as we used it here. It is bad practice to "write over" existing variables, because if you make a mistake there's no way to get the original data back. For example, even if you decided that you only cared about gear ratio rounded to the nearest integer,
is not recommended. It's always better to create a new variable. The if qualifierThe if qualifier is used to isolate a set of observations with variables meeting some particular criteria. Values on variables in a dataset are compared to values on other variables or to numbers or strings using logical comparision operators. This is very often used to create "dummy variables", 0-1 indicators used to indicate whether something is true or false.
Pay special attention to that double equals sign! If you are testing for equality, use a double equals sign (==). A single equals sign (=) is used to set something equal to something else. For example, if you want to list all information for a person in your dataset whose first name is Sara, you would type: . list if name=="Sara" To display the names of people with an income of less than or equal to $40,000: . list name if income<=40000 And to create a variable indicating people over 65: . gen senior_citizen = 1 if age >= 65 Combining tests: and and orIf on its own is useful if you are interested in testing for only one thing at once, such as the condition "over 65". But let's say you want to find out the mean income for women in your dataset between the ages of 25 and 34. What you need to do is take this series of tests and combine them with the and operator, &. . sum income if sex==0 & age >=25 & age <= 34 Note that the if statement is included only once, and then the tests are simply stated one after another. Also note that you need to write out the entire test statement each time: . sum income if age >=25 & <= 34 is not allowed. If you want to look at cases where at least one of two or more conditions is met, the or operator, | is needed. . gen child_of_immigrant=1 if birthplace_mother!="USA" | birthplace_father !="USA" . gen caffeinated=1 if drink=="coffee" | drink=="tea" | drink=="cola" It is possible to combine the & and | operators. It's good practice to group the statements using parentheses: . gen child_of_immigrant=1 if (birthplace_mother!="USA" | birthplace_father !="USA") & birthplace=="USA" When generating variables, it is good practice to include a test to exclude missing values. A peculiarity of Stata is that numerical missing, represented as a period (.), is internally treated as an infinitely large number, the highest number possible. So if you are testing for values greater than some number, missing values will always be included. This can produce very strange results. If you don't know what age some people in your dataset are, you don't want to include them in a variable indicating senior citizen-ness -- especially if other variables in your dataset indicate that these people are still in high school, or pregnant! . gen senior_citizen = 1 if age > 65 & age ~=. . gen tall = 1 if height>=72 & height~=. . replace tall = 0 if height < 72 The moral is, always check your variable creation statements and then ask yourself, "What is happening to the missings?" SubscriptingIndividual values of Stata variables can be accessed using subscripts. A subscript indexes the case number of a variable: var1[5] refers to the fifth observation of var1. For example, given a dataset of var1 var2 var3 1 1 1 2 4 3 3 9 5 .gen var4 = var2[var3] would produce var4 1 (var2[1]) 9 (var2[3]) . (var2[5]) In general, subscripting variables by other variables might not seem all that useful. An exception is the special internal Stata variable _n. _n is just a variable containing the case number of each observation. Another special internal variable is _N, which contains the number of cases in the current dataset (or, the maximum case number). For example, . gen var5 = var3[_n] is equivalent to . gen var5 = var3 because it simply sets each element of var5 equal to the corresponding element of var3. However, imagine that your dataset contained one observation per day and was in daily order. . gen lagvar3 = var3[_n-1] (lagvar3[1] ==.) . gen leadvar3 = var3[_n+1] (leadvar3[_N]==.) Another use of _n is "filling in" gaps in your dataset. Imagine that you have a dataset with population information over time, and the population is missing for a few months. You don't want to lose that data, so you can use the _n variable to fill in the missing values with the nearest preceding nonmissing value. . sort time . replace population = population[_n-1] if population==. |