![]() |
![]() |
Search DSS Finding Data Using Data About Us |
Regression With Stata, Frequently Asked QuestionsOn this page, I go over frequently asked regression questions/operations using Stata. They include:
The data files used in the examples are Stata installation sample data accessible with the command sysuse. Anyone who has an access to Stata can access the sample data. Interpreting regression outputSuppose I am investigating the relationship between types of cars and their miles per gallon. My hypothesis is that luxury models are gas guzzlers. I am testing this hypothesis using 1978 auto data. I use weight as a proxy for luxury models, as I expect luxury cars are heavier. It also seems to make sense that heavier cars would use more gas. At the command window, type: sysuse auto This brings up our sample data into Stata. Next, try: regress mpg weight Stata outputs analysis of variance (anova) results along with the regression results. Top left is anova table, and bottom is regression results. The dependent variable here is miles per gallon (mpg), and the variable name is shown at the left top of regression results table. The weight here is measured in pounds. The coefficients for weight and foreign are shown in the Coef. column. Std. Err. is Standard Error, t t test statistics, P>|t| the p values, and 95% Confidence Interval. The results can be written in regression equation form as: predicted MPG = 39.44 - 0.006WEIGHT For each pound increase in auto weight, miles per gallon decrease by 0.006, and it is statistically significant at least at 99% level (when shown as 0.000, it is less than 0.0005). You can see that the standard error is very small showing less variation and the absolute value of the t test statistic is relatively large. You can tell the statistical significance through the p value: when it is less than 0.05, it is significant at 95% level, and if it is less than 0.01, it is significant at 99% level. Constant (_cons) is an intercept of the regression line, or the starting point: mpg would be about 39 for cars with no weight. It may not make sense as such, but that is the average of mpg controlling for weight. Right top corner lists information associated with the anova and the regression output. Total number of observations used for the analysis is 74, F test statistic with 1 numerator degrees of freedom and 72 denominator degrees of freedom is 134, and it is statistically significant at 99% level, because the p value is 0.000. I will come back to the R-squared and adjusted R-squared in the next model. Root MSE is square root of the mean squared error (MS Residual in the anova table), and is the standard deviation of the error term, what is not explained by the model. What I did earlier is a simple regression with just one predictor variable. Now, I want to control whether the cars are U.S. models or non-U.S. models in addition to weight in predicting miles per gallon. Then it is an example of a multiple regression.Variables that have a binary outcome like this U.S. vs non-U.S. models are called dummy variables. The interpretation of the variable is easier if you code them as 0 or 1. Here, the variable foreign are coded 0 for US (domestic) cars and 1 for non-US (foreign) cars. predicted MPG = 41.68 - 0.0066WEIGHT - 1.65FOREIGN You can plug in 0 into foreign to estimate the MPG for domestic cars, and 1 for foreign cars: so MPG is 1.65 less for foreign cars than for domestic cars. Controlling for foreign cars, still, heavier models use more gas: each one pound increase in weight results in 0.0066 less mpg. Notice that foreign is not statistically significant at any conventional level of significance in this model. So can we say foreign, after all, is not important in estimating mpg? Here, it is very important that you distinguish statistical and substantive significance. Statistical significance shows you the probability that the sample value is the population value, assuming null hypothesis of no relationship is true. In addition, statistical significance can change by getting more observations, or by fitting the regression line better. Later you can see the change in the statistical significance of foreign by making an adjustment to the model. In the earlier model, R squared was 0.65, meaning about 65% of the variance in mpg is explained by the model. In this regression I got R-squared of 0.6627, so by adding one variable I am explaining the mpg 1% more. Adjusted R squared adjusts the value of the R squared by the ratio of the sample size to the number of variables. Naturally, R squared will be larger if you have more variables, but the adjusted R squared takes the number of variables into account. It can be useful when you have many variables and a small sample size. The formula to get the adjusted R squared is 1- ((1- R squared)* ((n-1)/(n-k-1)). In the earlier model, adjusted R-squared was 0.6467, and in the current model it is 0.6532. So still this model explains the mpg better. We jumped right in to regression, but there is a whole series of assumptions we are making in running regression analyses. In your study, you need to check the data to see if the regression assumptions are met. UCLA has very good sites where they discuss regression diagnostics. Using dummy variablesI have mentioned earlier about a dummy variable by including foreign in the model. I have another categorical variable, repair rating, that I am interested in seeing the effect on mpg. The repair rating, called rep78, ranges from 1 to 5, 1 being more repairs and 5 being less repairs. Here, the repair rating could be treated as a continuous value, but since it only has five values and I consider it as a categorical variable, I will make each of the value into a dummy variable. This kind of situation is more common with variables like ethnicity or occupation, where the assignment of number is rather arbitrary and the quantity does not have a meaning. An easy way to create a dummy variable from a multiple category variable like this rep78 is to use tabulate command. tab rep78, gen(repair) creates five dummies, one for each value of rep78. You can see the new variables Stata created by scrolling the variable window to the bottom. Notice that tabulation shows the total as 69, when total number of records is 74. It turns out that five cars have their repair ratings missing. Stata drops cases with missing values altogether when running regressions. So in the next model you can see that the total case used in the analysis is 69. Of the five categories, I can include four, one fewer categories than the total number of categories, in the model, as one of them will be a reference category. The coefficients will be interpreted in reference to the excluded category. predicted MPG = 27.36 - 6.36REPAIR1 - 8.24REPAIR2 - 7.93REPAIR3 - 5.70REPAIR4 The coefficients of repairs are in reference to repair rating 5. So the cars with repair rating 1 yields about 6.36 less mpg than the cars with repair rating 5, repair rating 2 yields about 8.23 less mpg than repair rating 5, and so on. It kind of makes sense that cars with better repair rating use less gas: they must be constructed to be more efficient. Each dummy is 0 or 1, so to compute the predicted mpg, you can plug in 1 to the rating you are looking at, and 0 for others. When a car has a repair rating 5, the predicted mpg is 27.36. When a car has a repair rating 1, the predicted mpg is 27.36-6.36 = 21. Some people are confused when I tell them to exclude a category to make it into a reference group. If you have only one set of dummies and want to include them all, you can fit a model with all the dummies but tell Stata that there already is a constant. I do not recommend using this if you have multiple sets of dummy variables, such as marital status (single, married, divorced, etc.) AND ethnicity(white, black, hispanic, asian, etc.), as where the intercept went can get confusing. This time, the coefficients are predicted mpg for each repair rating instead of difference in reference to the excluded category. The results are the same either way. Including quadratic termsIn this data, I happen to know that the relationship between mpg and weight are quadratic, and therefore square of the weight is necessary to improve the fit. How would I know that the quadratic term is necessary? One way is to examine the residual plots without a quadratic term against the suspected predictor variable. Here I suspect weight is quadratic, so I plot the residual of the model without square term against weight. I see that the plots show a curve, a sign that the error term is correlated with weight quadratically. I can also examine linear fit and quadratic fit between mpg and weight. In the following graphs, though, I am not controlling for foreign. graph twoway (scatter mpg weight) (lfit mpg weight) graph twoway (scatter mpg weight) (qfit mpg weight) Quadratic seems to be a better fit from these graphs, so I include it in the model. Now the coefficient of foreign is significant at 95% level, and the absolute value of the effect is larger. The equation will be: predicted MPG = 56.54 - 0.017WEIGHT + 0.0000016WEIGHTsquared - 2.2FOREIGN The effect of weight is -0.017+2(0.00000159)weight or -0.017+0.00000318weight. If I evaluate the effect of the weight at the mean (3019), then it is -0.017+0.0000032(3019) = -0.007: mpg decreases by 7 for additional 1000 pounds. Log transformationsIf the distribution of a variable has a positive skew, taking a natural logarithm of the variable sometimes helps fitting the variable into a model. Log transformations make positively skewed distribution more normal. Also, when a change in the dependent variable is related with percentage change in an independent variable, or vice versa, the relationship is better modeled by taking the natural log of either or both of the variables. For example, I estimate person's wage based on one's education, experience, and region of residence using Stata's sample data nlsw88, an extract from 1988 National Logitudinal Study of Young Women. sysuse nlsw88 reg wage grade tenure south It looks ok, but when I look at the distribution of tenure, it looks somewhat skewed. histogram tenure So I compute a natural log of tenure. gen lntenure=ln(tenure) histogram lntenure It seems to have overshot a little, but looks somewhat normal. I try a regression with the logged tenure. The R-squared has gotten a little higher, so taking the natural log seems to have helped to fit it in the model better. When the independent variable but not the dependent variable is logged, one percent change in the independent variable is associated with 1/100 times the coefficient change in the dependent variable. predicted wage = -1.639+0.681GRADE+0.774LNTENURE-1.134SOUTH So one percent increase in tenure is associated with an increase in the wage of 0.01x0.774 or about $0.0077. Now I examine the wage, and find that it is very skewed. histogram wage So I take a natural log of wage, and look at the distribution of logged wage. gen lnwage=ln(wage) histogram lnwage The distribution looks much more normal. Now I run the same regression with the logged wage as the dependent variable. reg lnwage grade tenure south When the dependent variable but not an independent variable is logged, a one-unit change in the independent variable is associated with a 100 times the coefficient percent change in the dependent variable. predicted lnwage=0.666+0.085GRADE+0.026TENURE-0.150SOUTH In this data, tenure is measured in years: so, one year increase in tenure increases the wage by 100x0.026 % or about 2.6%. If we logged both the dependent and an independent variables, then we are looking at elasticity: percentage change in X results in percentage change in Y. predicted lnwage = 0.659 + 0.084GRADE+0.136LNTENURE-0.151SOUTH One percent increase in tenure is estimated to result in about 0.136 % increase in wage. InteractionsBetween a Dummy and a Continuous VariablesWhen I included foreign in the gas model earlier, I was examining the effect of weight controlling for foreign (or foreign controlling for weight). There is one intercept, which takes on the effect of domestic cars. There, the effect of foreign was reflected as a different slope from domestic. Now suppose I think that the effect of weight on mpg is different for foreign and domestic cars. So I am thinking that foreign and domestic cars not only have different slopes but also have different intercepts. So I compute the interaction between foreign and weight by multiplying them, and include it in the model. predicted MPG = 39.65 -0.006WEIGHT + 9.27FOREIGN - 0.004FOREIGNWEIGHT Here, 39.65 is the intercept and -0.006 is the slope for domestic cars, and 39.65+9.27 or 48.92 is the intercept and -0.006-0.004 or -0.01 is the slope for foreign cars. Predicted mpg for domestic cars evaluated at the mean weight is 39.65-0.006(3019) = 21.54, and for foreign cars it is 48.92-0.01(3019) = 18.73. The difference may be easy to see in a graph. You can save the predicted values by issuing a command predict. I call the predicted value predicted2 for the model that includes the interaction and predicted1 that excludes the interaction. predict predicted2 Then I plotted observed values in dots and predicted values in lines, separately for domestic and foreign when a model includes an interaction term. You can see that the intercepts and the slopes are a bit different between the two. graph twoway (scatter mpg weight) (line predicted2 weight), by(foreign) Here I plotted the same without the interaction term. You can see that their slopes are about the same. graph twoway (scatter mpg weight) (line predicted1 weight), by(foreign) Between Two Continuous VariablesSuppose I suspect that the effect of weight on mpg is different by different value of length. So I compute weight-length interaction and include it in the model. predicted MPG = 67.45-0.01WEIGHT-0.18LENGTH+0.0000376WEIGHTLENGTH A change in predicted MPG given a change in weight is -0.01+0.0000376(length) and a change in predicted MPG given a change in length is -0.18+0.0000376(weight). MulticollinearityMulticollinearity is a condition where independent variables are strongly correlated with each other. When multicollinearity exists in your model, you may see very high standard error and low t statistics, unexpected changes in coefficient magnitudes or signs, or non-significant coefficients despite a high R-square. Stata drops perfectly collinear independent variables with warnings. If the collinearity is high but not perfect, you may want to examine for multicollinearity. You can check for multicollinearity by running a regression having each of the predictor variable as the dependent variable, against all the other predictors. Then examine how much of the variable's effect is independnt of other predictors. Using the same autodata, let's check if we observe multicollinearity. quietly at the beginning of the regress command suppresses the output. I executed the command to get the R-square, which is saved in the Stata's internal memory as e(r2). To learn more about saved results, type help ereturn in the Command window. The variable foreign seems to be ok, having about 62% of the effect independent of other predictors. But less than 2% of weight and weight2 are independent of other predictors. Weight2 is computed from weight, so it is understandable. The same values can be computed by using a regress postestimation command, estat vif. This time, you run the whole model including the dependent variable. 1/VIF gives the same values as 1-R2 we did earlier. VIF column shows by how much other coefficients variances (and standard errors) are increased due to the inclusion of that predictor. We see that foreign has no impact on other variances, but weight and weight2 affect the variances substantially. What can we do to address this problem? We may be able to reduce the multicollinearity by centering, which is subtracting the mean from the predictor values before generating the square term. Again, here, I execute summarize command to get the mean, which is saved as r(mean). Type help return in the Command window to learn more about the rclass variables. Let's check to see if centered_weights have corrected the multicollinearity we observed in weights. The correlation between weight and weight2 is 0.99, but the correlation between centered_weight and centered_weight2 is 0.14. Now 1/VIF shows that 61% of centered_weight's and 93% of centered_weight2's variances are independent of other variables. I used centering to show an example of how to correct for multicollinearity, but in this case, it may not really have been necessary. If you compare regression results using weights and centered_weights, you see that overall R-square and the p-values for weights are not so different between the two models. So you do not always have to do this centering when you include square term in the model. It may be more of an issue when there are two supposed different but very closely related variables are included and show the conditions described earlier, that standard errors are substantially high, coefficients' maginitudes and signs are unexpected, or coefficients are not significant while the R-squared is high. References Chatterjee, Samprit and Bertram Price. (1977) Regression Analysis by Example. New York: NY. John Wiley & Sons, INc. Hamilton, Lawrence. (2006). Statistics with Stata. Updated for Version 9. Belmont, CA: Thomson Brooks/Cole. |