# All pages

# Chapter 3: Non-linear associations

In the previous chapter, we assumed that associations between variables could be described by a straight line (or a linear function). The assumption was not unwarranted. The Norwegian mean education length seemed to have been rising at a relatively constant long-term rate. Thus, it made sense to represent it as a linear function of peoples’ birth year.

The Swedish sample, however, does not fit quite so well with the linear model. As shown in Figure 8, the line that indicates the cohorts’ mean education lengths tends to be above the linear regression line for those born between 1940 and 1960, and below the line for the rest. Thus, the growth of the mean education length seems to have decreased over time. The fit of the regression line is better than in the Norwegian case (R^{2} is 0.2), but in the Swedish case the fit can be made even better by adapting a curved regression line instead of a linear one. Some types of curved lines can be fitted with the OLS method. These regression lines are called curvilinear, and one such type, the quadratic regression line, is particularly popular because of its relative flexibility. It accommodates a wide range of different curve shapes.

Figure 8. Scatterplot with linear regression line. Swedish ESS data

# The quadratic regression line

A quadratic regression line can be expressed by the following function:

ŷ_{i} = a + b_{1}∙x_{i} + b_{2}∙x_{i}^{2}

Some of you may recognise this as a second degree polynomial function. Note that the independent variable x appears in two different additive terms on the right-hand side of the equals sign, first in its original form x_{i} and then as x_{i}^{2}, which is shorthand for x_{i}∙x_{i} (i.e. x_{i} multiplied by itself).

There are two different b coefficients here, b_{1} and b_{2}. Both must be computed by the regression analysis program. This is no problem as long as the factors we multiply the coefficients by (here x_{i}, and x_{i}^{2}) are not too strongly correlated with each other. Unfortunately, the original year of birth variable and that variable’s squared version are too strongly correlated, but this problem can be reduced if we deduct 1900 from the variable values of every person in the dataset, thus making birth year a two-digit variable with values starting at 5 instead of 1905. This has already been done for you in the specially prepared ESS EduNet regression module data set. You must do it yourself if you analyse other ESS data sets.

# Using SPSS to carry out a quadratic regression analysis

You can follow the instructions below, or use the SPSS syntax:

Syntax for example*Syntax for example in chapter 3, copy this syntax, paste it into a syntax window and run the syntax. *The following command causes the cases to be weighted by the design weight variable 'dweight'.

*The following commands cause SPSS to select for analysis those cases that belong to the Swedish sample (the cases whose value on the country variable is SE) and have lower values than 1975 on the birth year variable (& stands for AND, while < stands for 'less than'). *In this process, the commands create a filter variable (filter_$) with value 1 for the selected cases and value 0 for the non-selected cases.

*The values of the variable created by the following commands are the squared values of the two-digit 'birthyear' variable.

*These commands instruct SPSS to run a blockwise regression analysis with the variable 'birthyear' as independent variable in the initial model and to add the variable 'sqbirthyear' as a second independent variable in an expanded model.

Proceed as follows:

- Open the data set ‘Regression’ that you have downloaded from Nesstar WebView.
- Use the design weight to weight the cases.
- Select the Swedish cases. Remember to include only those respondents that were born before 1975
- Compute the square of the two-digit birth year variable by first selecting ‘Compute’ from the ‘Transform’ menu. Then type a name for the variable in the ‘Target variable’ field (for example ‘sqbirthyear’) and an explanatory label in the field that pops up when you click ‘Type & Label’ (for example ‘Squared two-digit year of birth’). Finally, put the function to be used in the ‘Numeric expression’ field (birthyear * birthyear). Click ‘OK’. Information about the new variable will be added as a new row at the bottom of the Data Editor’s ‘Variable View’ window. Go to ‘Data View’ by double clicking the new row’s first (shaded) cell if you want to view the new variable’s values.
- Finally, trick SPSS into performing a curvilinear regression analysis on the regression function y
_{i}= a + b_{1}∙x_{i}+ b_{2}∙x_{i}^{2}+ e_{i}, by adding the squared two-digit birth year as another independent variable beside the two-digit birth year variable in its original form. The technical procedure is essentially the same as before: - Click ‘Regression’ and ‘Linear’ on the ‘Analyses’ menu. Put the name of the education length variable in the ‘Dependent’ field and the name of the two-digit birth year variable in the ‘Independents’ field. From here, we could proceed by adding the name of the squared two-digit birth year variable in the same field as the name of the two-digit variable, but in order to demonstrate the difference between a linear and a curvilinear regression model, we click ‘Next’ and put the name of the squared variable in the blank field (Figure 9.) and finish with ‘OK’.

Figure 9. Running a quadratic regression analysis blockwise

By using the ‘Next’ option, we have made SPSS compute coefficients for two different models. Firstly, a_{2} and b_{3} of the model (function) y_{i} = a_{2} + b_{3}∙x_{i} + e_{i} and secondly, a_{1}, b_{1}, and b_{2} of the model y_{i} = a_{1} + b_{1}∙x_{i} + b_{2}∙x_{i}^{2} + e_{i} (We use different coefficient subscripts in the two functions because the coefficient values are different.) The resulting coefficients are presented in Table 3.

Table 3. SPSS output: Blockwise quadratic regression coefficients

The constant value (the a_{2}) of model 1 is very different from the one we estimated for Norway in example 2, see Table 1. The reason is that the zero point of the birth year variable now corresponds to year 1900 rather than year 0. Thus, those who dare extrapolate may interpret the constant as the predicted education length of a person born in year 1900.

The b coefficient of the linear Model 1 is of the same order as the Norwegian b-coefficient (0.108 as against 0.097). As expected, however, the analysis indicates that the linear model is not the best choice in the Swedish case. The b_{2} coefficient of the quadratic Model 2 is not high (-0.001), but it is high enough to have a discernible impact on the regression curve. This can be seen from Figure 10, where the regression line (based on the Model 2 coefficients) clearly rises at a decreasing rate as the birth year value increases. Figure 10 seems to corroborate our expectation that a quadratic regression line would follow the cohort mean education length more closely than the straight regression line in Figure 8 does. The procedures used to create Figure 10 have been described in chapter 1. The only difference is that one must tick ‘Quadratic’ instead of ‘Linear’ in the ‘Chart editor’s’ last dialogue box to get the curved regression line of Figure 10.

Figure 10. Scatterplot with quadratic regression line. Swedish ESS data

Finally, Table 4 reveals an (admittedly small) increase in R^{2} from Model 1 to Model 2, which indicates that the latter model fits the data somewhat better than the former. See more about this in the next chapter.

Table 4. SPSS output: Blockwise quadratic regression goodness of fit statistics

Regression analyses based on the function type y_{i} = a_{1} + b_{1}∙x_{i} + b_{2}∙x_{i}^{2} + e_{i} can produce regression lines of many shapes. The shapes vary with the signs and values of the computed coefficients. Figures that illustrate the effects of coefficient sign changes can be found below.

Here are some examples that demonstrate the curve shapes that can be created by means of quadratic functions. For instance, Figure A2 presents the curve that corresponds to the function y = 6 - 2∙x + 0.5∙x^{2}. It descends towards the right for low values of x. The reason is that, for these values, the negative term - 2∙x dominates over the positive term + 0.5∙x^{2}. But for higher values of x, the positive term dominates over the negative one, because the value of the squared x becomes much higher than the value of x and, consequently, compensates for the lower value of b_{2} compared with b_{1}. (While x = 1 implies x^{2} = 1, x = 2 implies x^{2} = 4, and so forth.)

^{2}

^{2}

^{2}

^{2}

## Exercise

Compute the squared ‘General subjective health’ variable and add it to the model you used in the exercise in chapter 2. What do the results tell you? Make a chart with a quadratic line by following the steps described in chapter 1, but tick ‘Quadratic’ instead of ‘Linear’ in the final dialogue box while in the ‘Chart editor’.