# All pages

# Chapter 7: Regression based on samples from several countries

What if we wish to use data from several countries simultaneously in a multiple regression analysis? This may create problems because the ESS survey sample is stratified, with county as the stratifying variable, while the SPSS ordinary linear regression module presupposes that we use a non-stratified, simple random sample. SPSS offers two different extensions of linear regression analysis that may alleviate this problem: a module for complex survey analysis and a mixed models module that handles multilevel analysis. You may want to check the virtues and possibilities of these modules if you plan to do regression analysis on data from many countries. If you only use individual level variables and data from a few countries, ordinary linear regression analysis may be an admissible option, but then you may have to take special precautions. You could, for example, weight the cases with the product of the ESS design weight and the ESS population size weight. (Use the ‘Compute’ feature in the ‘Transform’ menu to compute the product.) However, weighting the cases without making adequate adjustments to the standard error estimates may corrupt the statistical tests. Alternatively, to get more accurate statistical tests, you could skip weighting and enter country dummy variables as independent variables in your model.

But the latter solution presupposes that the associations between dependent and independent variables are constant across countries, which is frequently not the case. Therefore, the best ordinary linear regression solution might be to drop weighting and use regression models that allow regression slope coefficients to vary between countries. This can be achieved either by running separate regression analyses for each country (which brings us back to the solutions discussed in the previous chapters) or by supplementing our models with so called interaction terms (which are computed by multiplying each country dummy with every other independent variable, or at least with every other variable whose association with the dependent variable varies substantially between countries). It goes without saying that the total number of terms and coefficients in such a model may become excessively high if we include many independent variables and countries. We therefore recommend that, if the number of countries and interaction terms proliferate, and, in particular, if you want to assess the weighted mean association between variables across a group of countries rather than country-specific associations, you should drop ordinary linear regression and use the ‘General linear model’ program under SPSS’s ‘Complex samples’ module instead. On the next page, we will demonstrate how you can perform regression analyses with interaction terms.

# Interaction terms: Example

In the following example, we abstain from excesses and present a simple model in which we use data from Poland, Great Britain and Norway (which is the reference country). The focus is still on the association between year of birth and length of education, but now the model also includes the country dummies and the country * birth year interaction terms. The regression function can be expressed as follows:

y_{i} = a + b_{1}∙x_{GBi} + b_{2}∙x_{PLi} + b_{3}∙x_{Birthyear i} + b_{4}∙x_{GB * Birthyear i} + b_{5}∙x_{PL * Birthyear i} + e_{i}

where:

_{Birthyear i}is person i’s birth year

_{GBi}is a dummy variable with values Great Britain = 1, other countries = 0

_{PLi}is a dummy variable with values Poland = 1, other countries = 0

_{GB * Birthyear i}is the product of xGB i and xBirthyear i (an interaction term)

_{PL * Birthyear i}is the product of xPL i and xBirthyear i (an interaction term)

For members of the British sample, the value of the first interaction term is identical to their birth year value, while, for Norwegians and Poles, its value is 0.

Similarly, for members of the Polish sample, the value of the second interaction term is identical to their birth year value, while it is 0 for others.

Those who belong to the Norwegian sample are assigned the value 0 on both dummy variables as well as on both interaction terms. Hence, for Norwegians the regression function reduces to y_{i} = a + b_{3}∙x_{Birthyear i} + e_{i}, and the estimate of the coefficient b_{3} is therefore an estimate of the association between birth year and education length among Norwegians. (In other words, it is an estimate of the slope of the regression line for the association between birth year and education length in the Norwegian population.)

For the British, the regression function reduces to:

y_{i} = a + b_{1}∙x_{GBi} + b_{3}∙x_{Birthyear i} + b_{4}∙x_{GB * Birthyear i} + e_{i}

And since, for the British, x_{Birthyear i} = x_{GB * Birthyear i}, we can express their function as follows:

y_{i} = a + b_{1}∙x_{GBi} + (b_{3} +b_{4})x_{Birthyear i} + e_{i}

Thus, the slope of the British regression line can be estimated by taking the sum of b_{3} and b_{4}, while the coefficient b_{4} is an estimate of the difference between the British and the Norwegian regression line slopes. Similarly, b_{5} is an estimate of the difference between the Polish and the Norwegian slopes. Finally, note that the a-coefficient is an estimate of the mean education length of Norwegians who were born in the year 1900 (and a dubious one at that, since there are no 104-year-olds in the sample), while b_{1} could be seen as an estimate of the mean education differences between 104-year-old Britons and 104-year-old Norwegians. (What could b_{2} be seen as an estimate of?)

To perform a regression analysis based on this model, we must first compute the interaction term variables by multiplying the country dummy variables by the birth year variable. Just as in previous chapters, we use the ‘Compute’ feature in the ‘Transform’ menu to create the new variables. Start, for example, with the Great Britain * Year of birth interaction. Give the product of these two variables a name and a label. (In the example presented here, we gave it the label ‘Lives in Great Britain x two-digit year of birth’.) Next, instruct SPSS to compute this product. In the present example we did this by typing ‘Greatbritain * birthyear’ in the ‘Numerical Expression’ field, and clicking ‘OK’. (The asterisk * is the multiplication sign, ‘Greatbritain’ is the name of Great Britain’s country dummy variable, and ‘birthyear’ is the two-digit birth year variable’s name.) Follow the same steps to create the Poland * Year of birth interaction term. Finally, use the same procedures that have been demonstrated in the previous chapters to run the regression analysis. Here, we have put the birth year variable and the two country dummy variables in the first ‘Independent(s)’ field, and the two interaction terms in the field that appears when we click ‘Next’, so that we can use the F Change statistic to test whether the model that includes the interaction terms fits the data better than the model that does not include these terms. Remember to tick ‘R squared change’ in the ‘Statistics’ dialogue box.

Syntax that performs these procedures* The following command causes the cases to be weighted by the design weight variable 'dweight'.

*The following commands cause SPSS to select for analysis those cases that belong to the British, Polish or Norwegian sample (values GB, PL and NO on the country variable) and have lower values than 1975 on the birth year variable (& stands for AND, while | stands for OR, and < stands for 'less than'). *In this process, the commands create a filter variable (filter_$) with value 1 for the selected cases and value 0 for the non-selected cases. *Change the last part of line 2 (which starts after the first equals sign) if you wish to select other cases (if you do this, you should also change the variable label, which can be found within double quotation marks on line 3).

*Use this command to create a dummy variable that assigns value 1 to members of the Polish sample and 0 to the other selected cases.

*Computes a dummy variable that assigns the value 1 to members of the British sample and 0 to the rest.

*Compute interaction terms.

*Command to run regression with interaction terms.

# Interpreting results of regression with interaction terms: Example

Table 12 shows that adding interaction terms, and thus letting the model take account of the differences between the countries with respect to birth year effects on education length, increases the R^{2} value somewhat, and that the increase in the model’s fit is statistically significant. Correspondingly, the model 2 part of table 13 shows that both the Polish and the British associations between birth year and education length are significantly different from the Norwegian one at the 5% level. The estimate of the Polish regression line slope indicates that it is a notch steeper than the Norwegian (0.097 + 0.02 = 0.117 as against 0.097), while the British line seems to be less steep (0.097 – 0.037 = 0.06), which can be seen from the negative sign of the estimate of the interaction term’s coefficient. However, only the slope of the British regression line is significantly different from the Norwegian slope at the 1% level. Table 13 also shows that the mean Polish education length starts at a lower level than the Norwegian one in the older cohorts. (The country dummy is negative and statistically different from 0, which indicates that the Polish regression line cuts across the dependent variable axis at a smaller education length value than the Norwegian line does.)

Note that the model 1 estimate of the birth year’s coefficient (0.087) is a non-weighted mean of the three countries’ coefficients. It is necessary to weight the cases with the combined population size / design weight to obtain an unbiased estimate of the mean coefficient. (An estimate that takes account of the countries’ population size differences.) Using ordinary case weighting and regression analysis may produce better slope estimates (in the proximity of 0.078) but the statistical tests cannot be trusted. If correct statistical tests are an issue, you could use this ‘Complex samples’ procedure. (Weighted least squares regression with the population size / design weight, as the weighting variable would also be better than OLS regression on weighted cases.)

Complex samples procedure*Computes weight variable to be used in analyses aimed at estimating mean values for groups of countries.

*The following commands cause SPSS to select for analysis those cases that belong to the British, Polish or Norwegian sample (values GB, PL and NO on the country variable) and have lower values than 1975 on the birth year variable . *In this process the commands create a filter variable (filter_$) with value 1 for the selected cases and value 0 for the non-selected cases. *Change the last part of line 2 (which starts after the first equals sign) if you wish to select other cases (if you do this, you should also change the variable label, which can be found within double quotation marks on line 3).

*Use this command to create a dummy variable that assigns value 1 to members of the Polish sample and 0 to the other selected cases.

*Creates a dummy variable that assigns the value 1 to members of the British sample and 0 to the rest.

*Preparation of Plan file for Complex Samples analysis of ESS data. *The PLAN FILE command tells SPSS to store the plan file in the root directory of drive C, and you must change the name of the directory if you want the file to be stored somewhere else. *If you are using the dataset downloaded from ESS EduNet, please ignore the warning 'This procedure does not check the consistency of the working data file with the plan file. We recommend looking at the output table or the plan file to check consistency before performing selection or analysis'.

*Runs Complex Samples General Linear Model. *Note: If the plan file is not stored in the root directory of drive C, please insert the correct directory name in the PLAN FILE command.

Regression lines based on weighted model 1 estimates are presented in Figure 15, whereas Figure 16 demonstrates the more nuanced story told by the estimates obtained by using model 2, which includes interaction terms. Ultimately, it is up to you to decide whether the improved detail of the results presented in Figure 16 compared with those presented in Figure 15, is worth the effort and complexity of interpretation required if model 1 is replaced with model 2.

Note also that the use of interaction terms is not limited to cases in which the associations between dependent and independent variables vary between countries. In principle, it can be used in all cases where one variable’s association with the dependent variable varies with the value of another variable. The association between education and subsequent occupational career may, for instance, depend on people’s gender and social or ethnic background. In such cases, you can create interaction terms by multiplying education variables by gender or ethnicity variables etc.

# Exercise

First, compute a dummy variable set based on the nominal variable ‘Legal marital status’ (or use the set computed for the second exercise in chapter 5). You can use married as the reference category and thus skip making a dummy for those who are married. Secondly, if you do not already have one, compute a dummy-coded gender variable (women = 1, men = 0). Thirdly, create interaction terms by multiplying the gender dummy by all the ‘Legal marital status’ dummies. Finally, run a multiple regression analysis on the British sample, with gender, the marital status dummies and the interaction terms on the independents list and ‘Total hours normally worked per week in main job, overtime included’ as the dependent variable. Try to interpret the resulting coefficients and statistical tests. You should observe clearly significant interaction effects, with marital status having different associations with working hours among men than among women.