# Adding non-linearity to OLS regression models

The linearity assumption may be in conflict with theory and earlier research, which indicate non-linear relationships between an explanatory variable (X) and the dependent variable (Y). There are two ways that non-linearity can be added to an OLS-model. The most common one is to add the quadratic version of a continuous variable to the model. The second is to decompose the x-variable into a set of dummy variables. In our case, edyears and age are candidates for such treatment. In wage equations, age is expected to show a non-linear relationship to wage. This is normally modelled by adding age squared to the model. Together, age and age squared can describe a monotonic relationship with one inflection point. If the expected relationship is even more complex, it is possible to add a cubic term.

The next task is to create age squared and re-estimate the model with both versions of age. Try to do this in SPSS or Stata. For simplicity’s sake, we drop the set of dummy variables for social class. Also create a graph of the relationship between wage and age.

SPSS solution:

The SPSS syntax is not repeated here, but the output should look like this:

Table 1.10. Regression analysis with two age variables - SPSS output

The coefficient of age squared is clearly statistically significant and indicates that the relationship between age and wage is not linear. Note that the coefficients of edyears and female are only slightly changed, while the age coefficient and the constant are dramatically different as a result of agesqr being added. These two changes are quite normal, and the dramatic increase in the age coefficients is the result of adding the highly correlated age squared variable. The latter term violates the assumption about multicollinearity, but since both age variables have statistically significant coefficients, this is not normally seen as problematic. Adding the squared term means that the two age coefficients cannot be interpreted separately. The signs of the coefficients reveal their rough form. The positive coefficient for age and the negative one for age squared could indicate a monotonic increasing function of wage by age until a turning point is reached, from which point the function starts to decrease. The turning point can be calculated as follows: t / 2βq. This shows that the function turns at 49.6 years of age.

The best way to examine the relationship between wage and age is to plot predicted wage by age. This can be done in SPSS, Stata or Excel. We have to eliminate the control variables by assuming a set of values, normally their means or zero. Since both edyears and female have meaningful zero points, let us choose the latter option. In SPSS, we can now create a new variable for the predicted wage by age for men with only compulsory education in this way:

Compute pwage = 19.374 + 2.978*age+(-0.030)*agesqr.

We would obtain better precision by using more decimals for the square terms (0.029638).

Next, in SPSS, use graph scatter to create the following plot. Both Stata and especially Excel are more flexible in relation to creating graphs than SPSS. Both SPSS and Stata will compute the predicted value for all respondents, not only those that were used to estimate the wage equation. In the following graph, listwise exclusion from the regression analysis is applied, and age is cut at 67.

Figure 1.3. Testing for non-linearity - SPSS output
Stata solution:
Table 1.11. Regression analysis with two age variables - Stata output

The coefficient of age squared is clearly statistically significant and indicates that the relationship between age and wage is not linear. Note that the coefficients of edyears and female are only slightly changed, while the age coefficient and the constant are dramatically different as a result of agesqr being added. These two changes are quite normal, and the dramatic increase in the age coefficients is the result of adding the highly correlated age squared variable. The latter term violates the assumption about multicollinearity, but since both age variables have statistically significant coefficients, this is not normally seen as problematic. Adding the squared term means that the two age coefficients cannot be interpreted separately. The signs of the coefficients reveal their rough form. The positive coefficient for age and the negative one for age squared could indicate a monotonic increasing function of wage by age until a turning point is reached, after which point the function starts to decrease. The turning point can be calculated as follows: t / 2βq. This shows that the function turns at 49.6 years of age.

The best way to examine the relationship between wage and age is to plot predicted wage by age. This can be done in Stata, SPSS or Excel. We have to eliminate the control variables by assuming a set of values, normally their means or zero. Since both edyears and female have meaningful zero points, let us choose this option. In Stata, we can now create a new variable for the predicted wage by age for men with only compulsory education as follows:

generate pwage = 19.374 + 2.978*age+(-0.030)*agesqr

We could obtain better precision by using more decimals for the square terms (0.029638).

Next, use the Graphics menu to define the line graph of pwage to age. The command and the graph follow below:

. twoway (line pwage age, sort)
Figure 1.4. Testing for non-linearity - Stata output