![]() |
![]() |
Search DSS Finding Data Using Data About Us |
Home Exploring your DataExamining your DataYou can get basic information about your data and variables with the describe command (abbreviated d): . u mydata . d
If you want to display all (or some) of the variable values for all (or some) observations, use the list command (abbreviated l): . l # lists the value of every variable for every observation in the dataset (usually not a good idea!) . l sex income # lists the value of sex and income for every observation . l name if income>50000 # lists the names of people with an income of more than 50,000 . l in 1/10 # lists the values in observations 1 through 10 Summary StatisticsTwo commands that are useful for getting basic descriptive statistics for your variables are summarize and tabulate (abbreviated sum and tab respectively). sum gives the number of valid observations, mean, standard deviation, minimum and maximum values for any variables you specify. You can do the entire dataset at once: . sum or just a subset of variables: . sum income age The tab command gives you a frequency distribution (for one variable) or a crosstabulation (for two).
. tab sex
# frequency distribution for sex
. tab race sex
# crosstabulation showing values of race for each level of sex
. tab sex race, row col
# same as above, but with row and column percentages
. tab sex race, row col chi2
# same as above, but calculates Pearson's chi-squared for the
hypothesis that race and sex are independent.
.tab sex race, row col chi2
| race
sex | White Black Other | Total
-----------+---------------------------------+----------
Female | 5 3 2 | 10
| 50.00 30.00 20.00 | 100.00
| 50.00 42.86 66.67 | 50.00
-----------+---------------------------------+----------
Male | 5 4 1 | 10
| 50.00 40.00 10.00 | 100.00
| 50.00 57.14 33.33 | 50.00
-----------+---------------------------------+----------
Total | 10 7 3 | 20
| 50.00 35.00 15.00 | 100.00
| 100.00 100.00 100.00 | 100.00
Pearson chi2(2) = 0.4762 Pr = 0.788
(Notice that we've started getting into performing statistical tests!) The tab command usually only makes sense for categorical variables. Trying to tab an income variable with hundreds of different values wouldn't be very enlightening. Stata sensibly refuses to do a two-way table when the table would be excessively large. The sumarize command can be combined with the tabulate command to produce summaries of one variable for each value of another. The following table shows separate summaries of income for males and females.
. tab sex, sum(income)
| Summary of income
sex | Mean Std. Dev. Freq.
------------+------------------------------------
Female | 27100 12251.531 10
Male | 44100 23703.961 10
------------+------------------------------------
Total | 35600 20329.911 20
The commands table and tabstat also produce tables of summary statistics; look them up in the online help if you need something more detailed than tabulate, summarize will give you. To test whether there is a significant difference in the means between two groups, use the command ttest:
. ttest income, by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
Female | 10 27100 3874.274 12251.53 18335.78 35864.22
Male | 10 44100 7495.851 23703.96 27143.21 61056.79
---------+--------------------------------------------------------------------
combined | 20 35600 4545.906 20329.91 26085.31 45114.69
---------+--------------------------------------------------------------------
diff | -17000 8437.878 -34727.32 727.3229
------------------------------------------------------------------------------
Degrees of freedom: 18
Ho: mean(Female) - mean(Male) = diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
t = -2.0147 t = -2.0147 t = -2.0147
P < t = 0.0296 P > |t| = 0.0591 P > t = 0.9704
Regressionregress (abbreviated reg) is the command that runs simple linear regression. A binary regression of income on age would look like this: . reg income age Stata has dozens of commands for different types of regressions as well as other statistical procedures. In their simplest form, most follow the pattern . commandname dependentvariable independentvariable1 independentvariable2 . . . Check the manual or online help to see any syntax quirks for a given command. For more information on regressions, and in particular on how to interpret regression results, see our analysis page Interpreting Regression Results. Predicted ValuesThe predict command is used after regression commands to calculate predicted values, residuals, and other quantities based on the regression results, and store them in new variables, which can then be analyzed or used in further calculations. Each regression command has its own default and options for the predict command that can be looked up in the manual, but the basic structure of the command is simple. We will use ordinary regression as an example. First, run a regression: . reg income age sexThen, to store the predicted values (the default) in a variable called predicted_income, simply type: . predict predicted_income To find the residuals and store them in a variable called r, type: . predict r, resid
|