Adding categorical variables to OLS regression models

Our example of a categorical explanatory variable is egp, based on Eriksson, Golthorpe and Portocarero’s class schema [Eri92]. In our implementation, egp consists of five classes. The classes can be seen as being in ranked order, but the placement of the Routine non-manual class is questionable. In any case, the best way to add social class to the regression model is to decompose (recode) social class into a set of dummy variables, one less than the number of categories. Since we have five classes, four of them need to be represented by dummy variables and the omitted one will serve as a reference category.

Table 1.5. EGP classes, Frequency table
Class Frequency Percent of sample
1 Upper service class 328 7.9
2 Lower service class 1 181 28.6
3 Routine non-manual 1 248 30.2
4 Skilled workers 648 15.7
5 Unskilled workers 637 15.4
Valid 4 042 97.9
Missing 85 2.1
Total 4 127 100

Class schema of Eriksson, Golthorpe and Portocarero.

First, create the necessary set of dummy variables, egp1 to egp4 by recoding egp. Let unskilled workers be the reference category.

SPSS syntax
recode egp (1=1)(2,3,4,5=0)into egp1.
recode egp (2=1)(1,3,4,5=0)into egp2.
recode egp (3=1)(1,2,4,5=0)into egp3.
recode egp (4=1)(1,2,3,5=0)into egp4.
Stata syntax
tab egp, gen(egp)
This command creates the full set of dummy variables, also egp5 (the reference category).

Next, add the set of dummies, egp1 - egp4, to the previous regression model, and answer two questions: Does social class significantly improve upon our model? How do we interpret the coefficients?

SPSS tip
Add the set of dummy variables in a second block in the menus or by adding a second ‘/METHOD ENTER’ subcommand to the syntax.

Stata tip
Two steps are needed in Stata; first estimate the model and then use the test command after regress to perform the F-test to answer the first question.

SPSS solution
REGRESSION
/MISSING LISTWISE
/DESCRIPTIVES
/STATISTICS COEFF OUTS R ANOVA CHANGE
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT wage
/METHOD=ENTER edyears age female
/METHOD ENTER egp1 to egp4.

The most important parts of the output are shown below:

Table 1.6, 1.7 and 1.8. Regression analysis with two models - SPSS output

SPSS interpretation

The first table shows the R square, or the multiple correlation coefficient, for the basic model and the one with class added. The R square for the first model is 0.319. How should this be interpreted? The second one is 0.348, an increase of 0.029. Is this improvement in the R square statistically significant? This question can be answered with the help of the F Change statistic, which is 40.97 (df1=4, df2=3672), with a highly significant probability value (p<0.001). This outcome indicates that social class should be added to the model.

The F-test can be used to compare any nested models. In our case, model 1 is nested within model 2. The F statistic is computed from the residual sum of squares found in the ANOVA table. The sample size is n=3680, K=8 is the number of parameters in model 2, and H=4 is the difference in the number of parameters in the two models.

The coefficients are interpreted as the difference in wages between a given class and the reference category, controlling for education, age and gender. The Upper Service Class (egp1) earns NOK 16.67 more per hour than Unskilled Workers (the reference category), controlling for the other variables in the model.

Stata solution
. regress wage edyears age female egp1 egp2 egp3 egp4

Table 1.9. Regression analysis with two models - Stata output

Stata interpretation

The F-test can be used to compare any nested models. In our case, model 1 is nested within model 2. The F statistic is computed from the residual sum of squares for the two models found in the ANOVA part of the table. In addition to the above output for model 2, we need the output from model 1 to perform the F-test. The sample size is n=3680, K=8 is the number of parameters in model 2, and H=4 is the difference in the number of parameters in the two models.

However, the test post-estimation command is a short cut to the solution. We actually test the null hypothesis that the regression coefficients of the egp dummy variables are all zero. The F-statistic is highly significant, indicating that social class should be added to the model. The result is identical to the F-change test in SPSS.

The coefficients are interpreted as the difference in wages between a given class and the reference category, controlling for education, age and gender. The Upper Service Class (egp1) earns NOK 16.67 more per hour than Unskilled Workers (the reference category), controlling for the other variables in the model.

Go to next page >>

References