# All pages

# Chapter 1: Refreshing OLS regression

First, the multiple regression model for the population and its assumptions are presented in brief. The main part is a combination of some theory and exercises containing essential complexities: categorical explanatory variables, adding non-linearity, adding statistical interaction and a review of statistical tests. Those who are familiar with these topics may skip the first chapter.

# The OLS model

Multilevel analysis can be seen as an extension of ordinary least squares (OLS) regression that allows for more complex error structures. It is therefore essential to refresh basic regression skills. We will to do this by estimating and interpreting variations on a simple model for hourly wages among employees. The example has the advantage of having a continuous dependent variable and explanatory variables that will be familiar to everyone: years of education, age and gender.

The analyses will be variations of the following multiple regression model with hourly wage (Y) as the dependent variable and years of education (X_{1}), age (X_{2}), and female gender (X_{3}) as independent variables. The regression constant (intercept) and the regression coefficients are denoted by Greek letters (beta) and the residual (error) term by an ‘e’. The index ‘i’ applies to all employees (units) in the sample.

Y_{i} = β_{0} + β_{1}X_{1i} + β_{2}X_{2i} + β_{3}X_{3i} + e_{i}

The regression model for the population rests on two sets of assumptions, about the specification of the model and about the residuals. Let us quickly review these assumptions:

A model should be correctly specified:

- All relevant x-variables should be included, and irrelevant ones eliminated.
- The relationships between the x-variables and Y are linear.
- The model is additive, without statistical interaction between the x-variables.

Assumption 2 can be relaxed by building non-linearity into the model, for instance by adding polynomials. Assumption 3 can be relaxed by adding interaction terms to the model.

The assumptions about the residuals:

- The residuals should have a mean (expected value) of zero in the population.
- The residuals should have equal variance for subgroups of all x-variables (homoscedasticity).
- The residuals are uncorrelated with each other and with the x-variables.
- The residuals should be normally distributed.
- The x-variables should not be perfectly correlated, pairwise or group-wise (no multicollinearity).

The regression equation for the sample is normally expressed in Roman letters:

Y_{i} = b_{0} + b_{1}X_{1i} + b_{2}X_{2i} + b_{3}X_{3i} + e_{i}

# Preliminary analysis

The first step is to create the set of variables to be used in the analysis. This sometimes requires recoding or other operations on variables in the data set. Our data set is ready to use and has only a minimal number of variables. Let us familiar ourselves with them by looking at descriptive statistics and looking more carefully at the dependent variable by means of a histogram.

The first exercise consists of opening the file and learning the variable names.

- Download and open the file wage89.sav (SPSS) or wage89.dta (Stata) by clicking on the respective links below:
- Next, present descriptive statistics for the variables and show the distribution of the dependent variable,
*wage*, in a histogram.*You can copy, paste and run this syntax.

*Descriptive, chapter 1, page 2, exercise 2.

DESCRIPTIVES VARIABLES=wage edyears age female/STATISTICS=MEAN STDDEV MIN MAX.Table 1.1. Descriptive statistics - SPSS output

*Histogram, chapter 1, page 2, exercise 2.

GGRAPH/GRAPHDATASET NAME="graphdataset" VARIABLES=wage MISSING=LISTWISE REPORTMISSING=NO/GRAPHSPEC SOURCE=INLINE.BEGIN GPLSOURCE: s=userSource(id("graphdataset"))DATA: wage=col(source(s), name("wage"))GUIDE: axis(dim(1), label("Hourly wage in NOK"))GUIDE: axis(dim(2), label("Frequency"))ELEMENT: interval(position(summary.count(bin.rect(wage))), shape.interior(shape.square))END GPL.Figure 1.1. Histogram - SPSS output

*You can copy, paste and run this syntax.

*Descriptive, chapter 1, page 2, exercise 2.

. summarize wage edyears age femaleTable 1.2. Descriptive Statistics - Stata output Variable Obs Mean Std. Dev. Min Max wage 3759 90.15 30.31 25 343.75 edyears 4127 2.69 2.56 0 11 age 4127 39.65 12.36 16 74 female 4127 .47 .50 0 1 Only two decimals are shown.

*Histogram, chapter 1, page 2, exercise 2.

. histogram wage, frequencyFigure 1.2. Histogram - Stata output

# Basic OLS regression models

Let us estimate the regression model, first by using the familiar regression routine in SPSS and Stata and then by using the Mixed procedures for estimating multilevel models. In SPSS, use *Regression* to estimate the regression of wage on years of education, age and gender. In Stata, use *Regress* to estimate the model. We keep wage as the dependent variable, although most wage analyses are based on the natural logarithm of wage.

Table 1.3. Regression coefficients^{a} - SPSS output

The table shows the unstandardized (metric) regression coefficients, their standard error, the standardized regression coefficients (SPSS only), the t-value, its probability level, and 95 per cent confidence intervals for the regression coefficients (Stata only). To evaluate the statistical significance of the regression coefficient, we use the t-statistic, approximately normally distributed in large samples. The implicit statistical hypotheses are:

H_{0} : β = 0 and H_{1} : β ≠ 0

The test statistic is the t-ratio formed by dividing the estimated regression coefficient by its standard error:

or

t = b / s

_{b}

The critical value in the two-sided test with a level of significance of five per cent is |1.96|. In our example, all coefficients have very large t-ratios and they are statistically significant at any conventional level.

- How do we interpret the regression coefficients of the continuous variables: years of education and age?
- What about the interpretation of female, which is a dummy variable?

*Edyears*: for each added year of education, we expect the hourly wage to increase by NOK 4.87 controlling for age and gender. Alternatively, the marginal effect of one added year of education is to increase the expected wage by NOK 4.87.

*Female*: since female is a dummy variable, the regression coefficient is simply the difference in means between the categories: . Since the difference is negative, women (1) earn (NOK) 17.60 less than men (0) on average, controlling for age and education.

# Adding categorical variables to OLS regression models

Our example of a categorical explanatory variable is egp, based on Eriksson, Golthorpe and Portocarero’s class schema [Eri92]. In our implementation, egp consists of five classes. The classes can be seen as being in ranked order, but the placement of the Routine non-manual class is questionable. In any case, the best way to add social class to the regression model is to decompose (recode) social class into a set of dummy variables, one less than the number of categories. Since we have five classes, four of them need to be represented by dummy variables and the omitted one will serve as a reference category.

Class | Frequency | Percent of sample |
---|---|---|

1 Upper service class | 328 | 7.9 |

2 Lower service class | 1 181 | 28.6 |

3 Routine non-manual | 1 248 | 30.2 |

4 Skilled workers | 648 | 15.7 |

5 Unskilled workers | 637 | 15.4 |

Valid | 4 042 | 97.9 |

Missing | 85 | 2.1 |

Total | 4 127 | 100 |

Class schema of Eriksson, Golthorpe and Portocarero.

**First, create the necessary set of dummy variables, egp1 to egp4 by recoding egp.** Let unskilled workers be the reference category.

**Next, add the set of dummies, egp1 - egp4, to the previous regression model, and answer two questions**: Does social class significantly improve upon our model? How do we interpret the coefficients?

**SPSS tip**

Add the set of dummy variables in a second block in the menus or by adding a second ‘/METHOD ENTER’ subcommand to the syntax.

**Stata tip**

Two steps are needed in Stata; first estimate the model and then use the test command after regress to perform the F-test to answer the first question.

The most important parts of the output are shown below:

Table 1.6, 1.7 and 1.8. Regression analysis with two models - SPSS output#### SPSS interpretation

The first table shows the R square, or the multiple correlation coefficient, for the basic model and the one with class added. The R square for the first model is 0.319. How should this be interpreted? The second one is 0.348, an increase of 0.029. Is this improvement in the R square statistically significant? This question can be answered with the help of the F Change statistic, which is 40.97 (df1=4, df2=3672), with a highly significant probability value (p<0.001). This outcome indicates that social class should be added to the model.

The F-test can be used to compare any nested models. In our case, model 1 is nested within model 2. The F statistic is computed from the residual sum of squares found in the ANOVA table. The sample size is n=3680, K=8 is the number of parameters in model 2, and H=4 is the difference in the number of parameters in the two models.

The coefficients are interpreted as the difference in wages between a given class and the reference category, controlling for education, age and gender. The Upper Service Class (egp1) earns NOK 16.67 more per hour than Unskilled Workers (the reference category), controlling for the other variables in the model.

Table 1.9. Regression analysis with two models - Stata output

#### Stata interpretation

The F-test can be used to compare any nested models. In our case, model 1 is nested within model 2. The F statistic is computed from the residual sum of squares for the two models found in the ANOVA part of the table. In addition to the above output for model 2, we need the output from model 1 to perform the F-test. The sample size is n=3680, K=8 is the number of parameters in model 2, and H=4 is the difference in the number of parameters in the two models.

However, the test post-estimation command is a short cut to the solution. We actually test the null hypothesis that the regression coefficients of the egp dummy variables are all zero. The F-statistic is highly significant, indicating that social class should be added to the model. The result is identical to the F-change test in SPSS.

The coefficients are interpreted as the difference in wages between a given class and the reference category, controlling for education, age and gender. The Upper Service Class (egp1) earns NOK 16.67 more per hour than Unskilled Workers (the reference category), controlling for the other variables in the model.

# Adding non-linearity to OLS regression models

The linearity assumption may be in conflict with theory and earlier research, which indicate non-linear relationships between an explanatory variable (X) and the dependent variable (Y). There are two ways that non-linearity can be added to an OLS-model. The most common one is to add the quadratic version of a continuous variable to the model. The second is to decompose the x-variable into a set of dummy variables. In our case, *edyears* and *age* are candidates for such treatment. In wage equations, age is expected to show a non-linear relationship to wage. This is normally modelled by adding age squared to the model. Together, age and age squared can describe a monotonic relationship with one inflection point. If the expected relationship is even more complex, it is possible to add a cubic term.

The next task is to create *age squared* and re-estimate the model with both versions of age. Try to do this in SPSS or Stata. For simplicity’s sake, we drop the set of dummy variables for social class. Also create a graph of the relationship between wage and age.

The SPSS syntax is not repeated here, but the output should look like this:

Table 1.10. Regression analysis with two age variables - SPSS output
The coefficient of age squared is clearly statistically significant and indicates that the relationship between age and wage is not linear. Note that the coefficients of *edyears* and *female* are only slightly changed, while the age coefficient and the constant are dramatically different as a result of *agesqr* being added. These two changes are quite normal, and the dramatic increase in the age coefficients is the result of adding the highly correlated age squared variable. The latter term violates the assumption about multicollinearity, but since both age variables have statistically significant coefficients, this is not normally seen as problematic. Adding the squared term means that the two age coefficients cannot be interpreted separately. The signs of the coefficients reveal their rough form. The positive coefficient for age and the negative one for age squared could indicate a monotonic increasing function of wage by age until a turning point is reached, from which point the function starts to decrease. The turning point can be calculated as follows: -β_{t} / 2β_{q}. This shows that the function turns at 49.6 years of age.

The best way to examine the relationship between *wage* and *age* is to plot predicted wage by age. This can be done in SPSS, Stata or Excel. We have to eliminate the control variables by assuming a set of values, normally their means or zero. Since both edyears and female have meaningful zero points, let us choose the latter option. In SPSS, we can now create a new variable for the predicted wage by age for men with only compulsory education in this way:

We would obtain better precision by using more decimals for the square terms (0.029638).

Next, in SPSS, use graph scatter to create the following plot. Both Stata and especially Excel are more flexible in relation to creating graphs than SPSS. Both SPSS and Stata will compute the predicted value for all respondents, not only those that were used to estimate the wage equation. In the following graph, listwise exclusion from the regression analysis is applied, and age is cut at 67.

Figure 1.3. Testing for non-linearity - SPSS output
The coefficient of age squared is clearly statistically significant and indicates that the relationship between age and wage is not linear. Note that the coefficients of *edyears* and *female* are only slightly changed, while the age coefficient and the constant are dramatically different as a result of *agesqr* being added. These two changes are quite normal, and the dramatic increase in the age coefficients is the result of adding the highly correlated age squared variable. The latter term violates the assumption about multicollinearity, but since both age variables have statistically significant coefficients, this is not normally seen as problematic. Adding the squared term means that the two age coefficients cannot be interpreted separately. The signs of the coefficients reveal their rough form. The positive coefficient for age and the negative one for age squared could indicate a monotonic increasing function of wage by age until a turning point is reached, after which point the function starts to decrease. The turning point can be calculated as follows: -β_{t} / 2β_{q}. This shows that the function turns at 49.6 years of age.

The best way to examine the relationship between wage and age is to plot predicted wage by age. This can be done in Stata, SPSS or Excel. We have to eliminate the control variables by assuming a set of values, normally their means or zero. Since both *edyears* and *female* have meaningful zero points, let us choose this option. In Stata, we can now create a new variable for the predicted wage by age for men with only compulsory education as follows:

We could obtain better precision by using more decimals for the square terms (0.029638).

Next, use the Graphics menu to define the line graph of pwage to age. The command and the graph follow below:

#### Additional exercise:

Does the linearity assumption hold for years of education?

Answer:# Adding interaction terms to OLS regression models

Do men and women profit equally from an added year of education? This question can be answered by adding the education by gender interaction term to the model.

SPSS solution
The interaction term is simply the product of the two variables, *female* and *edyears*. In SPSS, we can create a new variable called *edfem* as follows:

Let us add this term to the model and re-estimate:

Table 1.12. Regression analysis with interaction term - SPSS output
First, we see that the coefficient of the statistical interaction term is statistically significant. This means that that the interaction effect should not be ignored. How should it be interpreted? The coefficient of the interaction term is the difference in the effect of education between women and men. The coefficient of *edyears* is no longer a general (main) effect, but the effect of education for men, i.e. when female=0. In other words, the marginal effect of adding one year of education is estimated to be 4.842 for men and 4.842-0.677= 4.165 for women.

To show this more clearly, it is best to work from the equation, replacing the x-symbols by variable names and *edfem* by its components.

_{i}= b

_{0}+ b

_{1}Edyears

_{1i}+ b

_{2}Age

_{2i}+ b

_{3}Agesqr

_{3i}+ b

_{4}Female + b

_{5}Edyears*Female + e

_{i}

Now, the effects of education and female cannot be interpreted separately. *Edyears* appears in two terms with the coefficients b1 and b5. Let us create two new equations, one for men and one for women, where we substitute the genders with their codes.

Men:

Y_{i}= b

_{0}+ b

_{1}Edyears

_{1i}+ b

_{2}Age

_{2i}+ b

_{3}Agesqr

_{3i}+ b

_{4}*0 + b

_{5}Edyears*0 + e

_{i}

Women:

Y_{i}= b

_{0}+ b

_{1}Edyears

_{1i}+ b

_{2}Age

_{2i}+ b

_{3}Agesqr

_{3i}+ b

_{4}*1 + b

_{5}Edyears*1 + e

_{i}

The actual regression coefficient for years of education is now: (b_{1 + b5)Edjears. For men this reduces to b1, and for women, the coefficient of years of education is b1+b5.
}

The interaction term is simply the product of the two variables, *female* and *edyears*. In Stata, we can create a new variable called *edfem* as follows:

Let us add this term to the model and re-estimate:

First, we see that the coefficient of the statistical interaction term is statistically significant at the 0.05 level. This means that that the interaction effect should not be ignored. How should it be interpreted? The coefficient of the interaction term is the difference in the effect of education between women and men. The coefficient of edyears is no longer a general (main) effect, but the effect of education for men, i.e. when female=0. In other words, the marginal effect of adding one year of education is estimated to be 4.842 for men and 4.842-0.677= 4.165 for women.

To show this more clearly, it is best to work from the equation, replacing the x-symbols by variable names, and edfem by its components

Y_{i}= b

_{0}+ b

_{1}Edyears

_{1i}+ b

_{2}Age

_{2i}+ b

_{3}Agesqr

_{3i}+ b

_{4}Female + b

_{5}Edyears*Female + e

_{i}

Now, the effects of education and female cannot be interpreted separately. Edyears appears in two terms with the coefficients b_{1} and b_{5}. Let us create two new equations, one for men and one for women, where we substitute the genders with their codes.

Men:

Y_{i}= b

_{0}+ b

_{1}Edyears

_{1i}+ b

_{2}Age

_{2i}+ b

_{3}Agesqr

_{3i}+ b

_{4}*0 + b

_{5}Edyears*0 + e

_{i}

Women:

Y_{i}= b

_{0}+ b

_{1}Edyears

_{1i}+ b

_{2}Age

_{2i}+ b

_{3}Agesqr

_{3i}+ b

_{4}*1 + b

_{5}Edyears*1 + e

_{i}

The actual regression coefficient or effect of years of education is now: (b_{1} + b_{5}*Edyears*. For men this reduces to b_{1}, and for women, the coefficient of years of education is b_{1}+b_{5}.

### Final comments

As an additional exercise, you can redo the example using the natural logarithm of wage as the dependent variable. Note that the interpretation of the regression coefficients changes. The regression coefficient of education will now show the proportional change in wages if one year of education is added.

Go to next chapter >>- [Eri92] Erikson, R. & J.H. Goldthorpe (1992): The constant flux. A study of class mobility in industrial societies. Oxford: Clarendon Press.