# All pages

# Chapter 6: Multiple linear regression analysis

Multiple regression is regression analysis with more than one independent variable. We used multiple linear regression in a technical sense in the previous chapter when we split the country variable into two dummy variables. However, these dummies were derived from one single source variable. In this chapter, we demonstrate regression with multiple independent variables that are not derived from a single source variable.

# Why include more than one independent variable?

Obviously, dependent variables are always associated with more than one other variable. Some of these other variables will normally be associated with each other, which means that they have some of their association with the dependent variable in common. Unless we include them all in the regression analysis, we cannot distinguish this common part of the association from the association that is ‘unique’ to each. For instance, we may be interested in the association between gender and occupational status. But besides being associated with gender, occupational status is also associated with educational level and professional experience etc., which, in turn, may be associated with gender. To find out how much of the association between gender and occupational status that occurs independently of gender differences in education and experience, we need to include educational level and professional experience as independent variables in our analysis. (This makes it possible to compare men and women with equal education and experience.) The inclusion of other variables is particularly important if we aim to study causal effects of independent variables. Not including relevant independent variables increases the risk of overlooking causal associations or of confounding them with non-causal ones. It also keeps the unexplained dependent variable variance high and the reliability of our coefficient estimates low.

We introduce multiple regression with an example based on Polish data. The task is to estimate how year of birth and the respondents’ parents’ education levels are associated with the respondents’ length of education. The parents’ education variables are ordinal. We could, however, follow the not uncommon practice of treating such variables as metric. The alternative strategy, which allows us to keep persons whose parents’ education is unknown in the active dataset, is to treat the variables as nominal and recode them into sets of dummy variables. There is a chance that this will improve the model’s fit with the data, and it gives us a larger and perhaps more representative dataset. But these nice features come at the cost of having to estimate more coefficients, which may decrease the accuracy of the coefficient estimates we wanted to compute in the first place. However, this accuracy also depends on the fit of our model and on the number of persons in our active dataset, and, since the ESS single-country datasets are quite large, we would normally be willing to estimate a few dummy variable sets with a limited number of variables in each, rather than treating ordinal variables as metric variables. In particular, we would do this if it substantially increased the fit of our model. Unfortunately, there is no statistical test that can tell us whether a model based on a dummy variable set is better than a model based on the original ordinal variable. You can use the dummy variable solution if you are able to keep the number of dummy variables small, but you can also choose the ordinal variable model if its adjusted R^{2} is higher than the dummy-based model’s adjusted R^{2} or if the merits of the latter are too small to make up for its shortcomings. (In this case, the adjusted R^{2}, which is reported alongside the R^{2} and which takes account of the number of coefficients included in the model, is better suited for comparing models’ fits than the proper R^{2} is.)

# Multiple regression with dummy variables

Now, let us look at the dummy variable solution. The regression function has the same general form as the one we saw in chapter 5. It is additive, with a long series of terms joined by plus signs lined up on the right-hand side as follows:

y_{i} = a + b_{1}∙x_{1i} + b_{2}∙x_{2i} + ……+ b_{k}∙x_{ki} + e_{i}

(The terms between term number 2 and term number k have been replaced by dots.)

A fully specified version of the function used in the analysis presented in this chapter plus an overview of variables and value codes, can be found below:

Appendix. The regression function used in chapter 6The regression function of the analysis presented in chapter 6 can be expressed as follows:

y_{i} = a + b_{1}∙x_{Birthyear i} + b_{2}∙x_{Fathered 1 i} + b_{3}∙x_{Fathered 2 i} + b_{4}∙x_{Fathered 4 i} + b_{5}∙x_{Fathered 5 i} + b_{6}∙x_{Fathered 6 i} + b_{7}∙x_{Fathered 7 i} + b_{8}∙x_{Fathered 8 i} + b_{9}∙x_{Mothered 1 i} + b_{10}∙x_{Mothered 2 i} + b_{11}∙x_{Mothered4 i} + b_{12}∙x_{Mothered 5 i} + b_{13}∙x_{Mothered 6 i} + b_{14}∙x_{Mothered 7 i} + b_{15}∙x_{Mothered 8 i} + e_{i}

Where

- y
_{i}is person i’s education length - x
_{Birthyear i}is person i’s birth year

The father’s education reference category is ‘Lower secondary or second stage of basic’. The following variables are the father’s education dummy variable set:

- x
_{Fathered 1 i}has value 1 if i’s father has not completed an education, and value 0 if he has - x
_{Fathered 2 i}has value 1 if i’s father’s highest education is primary or first stage of basic, and value 0 if it is not - x
_{Fathered 4 i}has value 1 if i’s father’s highest education is upper secondary, and value 0 if it is not - x
_{Fathered 5 i}has value 1 if i’s father’s highest education is post-secondary, non-tertiary, and value 0 if it is not - x
_{Fathered 6 i}has value 1 if i’s father’s highest education is first stage of tertiary, and value 0 if it is not - x
_{Fathered 7 i}has value 1 if i’s father’s highest education is second stage of tertiary, and value 0 if it is not - x
_{Fathered 8 i}has value 1 if i’s father’s highest education is unknown, and value 0 if it is not

Similarly, the mother’s education reference category is ‘Lower secondary or second stage of basic’. The mother’s education dummy variable set has the same categories as the father’s education dummy variable set. For example:

- XMothered 1 i has value 1 if i’s mother has not completed an education, and value 0 if she has And so forth.

Except for the constant and the residual, each of the terms in the function is a product of a regression coefficient and a variable. By choosing this additive form, we make the assumption that the ‘effect’ of one independent variable on the dependent variable is measured by the size of its own b-coefficient, and that this ‘effect’ is independent of the other variables and coefficients. The independent variables may still affect each other, but this does not preclude us from assuming that the effect of an independent variable on the dependent variable is unaffected by the other independent variables.

What the independent variables in our model are can be inferred from Table 11, but note that both the father’s education reference category and the mother’s education reference category are ‘Lower secondary or second stage of basic’.

In order to utilise the information provided by persons whose mother’s or father’s education is not known by creating valid dummy variables for those groups, we must redefine as non-missing the corresponding values (don’t know etc.) of the parents’ education variables. (Use the procedures proposed here, or the one used in the syntax file of the present analysis, which applies the VALUE function to create father’s and mother’s education variables with no ‘User-missing’ values. These variables are subsequently used to create the dummy variable sets.)

Syntax performing the steps described below*The following command causes the cases to be weighted by the design weight variable 'dweight'.

*The following commands cause SPSS to select for analysis those cases that belong to the Polish sample (value PL on country variable) and have lower values than 1975 on the birth year variable (& stands for AND, < stands for 'less than'). *In this process, the commands create a filter variable (filter_$) with value 1 for the selected cases and value 0 for the non-selected cases.

*Computes new 'father's highest level of education' variable where user-missing values have been redefined as non-missing.

*The following commands make SPSS compute one dummy variable for each level of the respondents' fathers' highest education, including one dummy for those who have not supplied information about their father's education. *Those whose fathers have a ‘Lower secondary or second stage of basic’ education will be used as a reference category. *Hence, this dummy variable will not be included in the regression.

*Computes new 'mother's highest level of education' variable where user-missing values have been redefined as non-missing.

*The following commands make SPSS compute one dummy variable for each level of the respondents' mothers' highest education, including one dummy for those who have not supplied information about their mother's education. *Those whose mothers have a ‘Lower secondary or second stage of basic’ education will be used as a reference category. *Hence, this dummy variable will not be included in the regression.

*Deletes newly created variables that will not be needed any more.

*Executes regression analysis with birth year, a set of dummy variables indicating mother's highest education and a set of dummies indicating father's highest education entered as independent variables in three blocks.

If you want to develop your skills by preparing a replica of this analysis from scratch by means of the SPSS menus, you must start by creating the parents’ education dummy variable sets. You have to make one variable set for the mother’s education level and another set for the father’s education level. Proceed as suggested in chapter 5 where we explained how you could create a set of dummy variables. You must make one dummy variable for each of the mother’s education levels except for the reference category, which is ‘lower secondary or second stage of basic education’. The same applies to the father’s education. In addition, you should use the procedure explained above to redefine “unknown mother’s education” and “unknown father’s education” as non-missing values, and then create one dummy variable for each of these two categories as well. Use the ‘Compute Variable’ module from the ‘Transform’ menu. Start, for instance, by creating the dummy variable that assigns the value 1 to those whose mother has no completed education and the value 0 to all other respondents. To achieve this you must open the dialogue box and give the new variable a name (e.g. ‘Mnoeducation’, and a label (e.g. ‘Mother has no education’). Then, on the right-hand side of the equals sign, type ANY(mothered,0). You must type ‘mothered’ before the comma in the ANY-function because ‘mothered’ is the name of the original mothers’ education variable, and you have to type 0 after the comma because those whose mother has no education have been assigned the value 0 on that variable. Thus, to create a dummy variable by means of the ANY-function, you have to put the name of the original nominal variable before the comma in that function, and the value of the relevant category on that variable after the comma. Finish this step by clicking ok and proceed by creating the dummy variable that assigns the value 1 to those whose mothers have a primary or first stage of basic education and the value 0 to all other respondents. Type Meduprimary = ANY (mothered,1), and insert a suitable variable label, e.g. 'Mother has primary or first stage of basic education'. You must put 1 after the comma in the ANY-function because 1 is the value on the ‘mothered’ variable of those whose mother has a primary or first stage of basic education. Use the same procedure to create the remaining 5 mother's education dummy variables. Remember to put the appropriate values after the comma in the ANY-function. Find out what the appropriate values are by right clicking the variable label ‘Mother’s highest education’ on the dialogue box’ variable list. Continue by clicking ‘Variable information’. A box that contains information about value codes opens. The column on the left hand side gives you the numerical values to be used in the ANY-function, while the column on the right hand side gives you the corresponding value labels. Go on using the same procedures to create the father's education dummy variables.

You will soon discover that it is very time consuming to create all the dummy variables by means of the SPSS menu-system. Alternatively, you could create the first dummy variable in this way, paste the corresponding syntax to a syntax window by clicking ‘paste’ instead of ‘ok’, and then proceed by copying this syntax and pasting 6 copies of it beneath the original (one copy for each of the 6 remaining mother’s education dummy variables). Finish by editing the variable names, variable labels and ANY-functions of the 6 syntax-copies according to the instructions given above, select the entire syntax (all seven commands) and run it by clicking the blue arrow on the tool bar. Use the same procedure to create the father’s education dummies.

The rest is easy. You open the linear regression dialogue box, put the education length variable in the dependent variable field and the birth year variable in the independent variables field. Then you click ‘next’ and add all the 7 mother’s education dummy variables. Finally, you click ‘next’ once more, add the fathers education dummy variables, tick the ‘R-squared change’ statistics option, and finish by clicking ‘ok’.

The results obtained from analysing the Polish sample are presented in Table 10 and Table 11. The independent variables are added to the model blockwise. The R^{2} values and statistical tests reported in Table 10 show that the explained dependent variable variance increases as we add new blocks of dummy variables. The largest increase occurs when the mother’s education variables are introduced. But note that this does not prove that the average person’s mother’s education has a larger effect on his or her education length than the father’s education. Switching the block sequence would cause the introduction of the father’s education dummies to raise the R^{2} value even more than the mother’s education dummies do here. The reason is that mothers with high education levels tend to be married to men with high education levels. Therefore, if the mother’s education dummies are associated with the offspring’s education length, the father’s education dummies will be associated with their education length as well, and, if we add one of the sets without adding the other, the former set can be expected to contribute two components to the ‘explanation’ of the variance of the dependent variable: One unique component and one component that it has in common with the other dummy set. This common component of the ‘explained’ variance cannot be counted twice, so the subsequent extension of the model by the second set of dummy variables will only increase the ‘explained’ part of the dependent variable’s variance by a component that is unique to that set.

The coefficients table (Table 11, Model 3) lends support to the idea that both the mother’s and father’s education levels have effects on their children’s education length. If we choose a 1% significance level, we can conclude that, on average, those whose mothers or fathers have no completed education tend to have shorter education length than those whose fathers or mothers have a lower secondary education. Similarly, we find that, on average, those whose mothers or fathers have an upper secondary education tend to have longer education length than those whose fathers or mothers have a lower secondary education, and so forth.

Table 10. SPSS output: Multiple regression goodness of fit statistics

It is also worth noting that the estimated slope of the regression line that describes the association between year of birth and education length decreases as new variables are added to the model. Model 1 gives an estimate of 0.117. When the mother’s education variables are added in Model 2, the estimate decreases to 0.08, and when we supplement these variables with the father’s education variables, we arrive at an estimate of 0.072. This decrease in the estimates is caused by the association between birth year and parents’ education levels plus the fact that there is a tendency for children to inherit their parents’ education levels. Younger parents tend to have higher education levels than older parents, and parents with high education levels tend to have children with high education levels. Thus, part of the association between birth year and education length is mediated through the offspring’s inheritance of their parents’ educational attainment opportunities, and when the parents’ education levels are entered as independent variables in the regression model, this part of the association is subtracted from the birth year variables regression coefficient and reappears as an integrated part of the parents’ education variables’ coefficients.

Similar things happen to the coefficients of the mother’s education variables when the model is extended by adding the father’s education variables. We can interpret this as a sign that some of the effects that the mother’s education appears to have on her child’s education (according to the coefficients estimated for model 2), are actually attributable to the father’s education level.

Remember that the constant term is an estimate of the average education length of those who have value 0 on all independent variables. Here, people with such values are supposed to have been born in year 1900 (of whom there are none in the data set), and to have two parents who had both completed a lower secondary education.

Table 11. SPSS output: Multiple regression coefficients

## Exercise

Run multiple regression with ‘Total hours normally worked per week’ as the dependent variable and gender and various job characteristics (‘Current job’) as independent variables. Interpret the results. Which job characteristics are associated with what effects?