# Adding categorical variables to OLS regression models

Our example of a categorical explanatory variable is egp, based on Eriksson, Golthorpe and Portocarero’s class schema [Eri92]. In our implementation, egp consists of five classes. The classes can be seen as being in ranked order, but the placement of the Routine non-manual class is questionable. In any case, the best way to add social class to the regression model is to decompose (recode) social class into a set of dummy variables, one less than the number of categories. Since we have five classes, four of them need to be represented by dummy variables and the omitted one will serve as a reference category.

Class | Frequency | Percent of sample |
---|---|---|

1 Upper service class | 328 | 7.9 |

2 Lower service class | 1 181 | 28.6 |

3 Routine non-manual | 1 248 | 30.2 |

4 Skilled workers | 648 | 15.7 |

5 Unskilled workers | 637 | 15.4 |

Valid | 4 042 | 97.9 |

Missing | 85 | 2.1 |

Total | 4 127 | 100 |

Class schema of Eriksson, Golthorpe and Portocarero.

**First, create the necessary set of dummy variables, egp1 to egp4 by recoding egp.** Let unskilled workers be the reference category.

**Next, add the set of dummies, egp1 - egp4, to the previous regression model, and answer two questions**: Does social class significantly improve upon our model? How do we interpret the coefficients?

**SPSS tip**

Add the set of dummy variables in a second block in the menus or by adding a second ‘/METHOD ENTER’ subcommand to the syntax.

**Stata tip**

Two steps are needed in Stata; first estimate the model and then use the test command after regress to perform the F-test to answer the first question.

The most important parts of the output are shown below:

Table 1.6, 1.7 and 1.8. Regression analysis with two models - SPSS output#### SPSS interpretation

The first table shows the R square, or the multiple correlation coefficient, for the basic model and the one with class added. The R square for the first model is 0.319. How should this be interpreted? The second one is 0.348, an increase of 0.029. Is this improvement in the R square statistically significant? This question can be answered with the help of the F Change statistic, which is 40.97 (df1=4, df2=3672), with a highly significant probability value (p<0.001). This outcome indicates that social class should be added to the model.

The F-test can be used to compare any nested models. In our case, model 1 is nested within model 2. The F statistic is computed from the residual sum of squares found in the ANOVA table. The sample size is n=3680, K=8 is the number of parameters in model 2, and H=4 is the difference in the number of parameters in the two models.

The coefficients are interpreted as the difference in wages between a given class and the reference category, controlling for education, age and gender. The Upper Service Class (egp1) earns NOK 16.67 more per hour than Unskilled Workers (the reference category), controlling for the other variables in the model.

Table 1.9. Regression analysis with two models - Stata output

#### Stata interpretation

The F-test can be used to compare any nested models. In our case, model 1 is nested within model 2. The F statistic is computed from the residual sum of squares for the two models found in the ANOVA part of the table. In addition to the above output for model 2, we need the output from model 1 to perform the F-test. The sample size is n=3680, K=8 is the number of parameters in model 2, and H=4 is the difference in the number of parameters in the two models.

However, the test post-estimation command is a short cut to the solution. We actually test the null hypothesis that the regression coefficients of the egp dummy variables are all zero. The F-statistic is highly significant, indicating that social class should be added to the model. The result is identical to the F-change test in SPSS.

The coefficients are interpreted as the difference in wages between a given class and the reference category, controlling for education, age and gender. The Upper Service Class (egp1) earns NOK 16.67 more per hour than Unskilled Workers (the reference category), controlling for the other variables in the model.

#### References

- [Eri92] Erikson, R. & J.H. Goldthorpe (1992): The constant flux. A study of class mobility in industrial societies. Oxford: Clarendon Press.