# All pages

# Chapter 5: Nominal independent variables

How can we assess the association between a metric dependent variable and a nominal independent variable, i.e. an independent variable that has more than two qualitatively different values? One such variable is country of residence. The original coding of the country variable makes it unfit for use in regression analyses. Regression analysis requires numerical variable values, but this variable has strings of letters as value codes. This can be fixed by recoding into numerical values, but what numerical values should we choose? There is no meaningful way in which we can create a generally applicable numerical ranking of countries of residence. Unfortunately, that is what we have to create if we want to represent more than two different countries by one single variable in a regression analysis.

Rather than using one single variable, the solution is to recode the country variable into a set of dichotomous variables. In chapter 1, we saw that we can use gender, a dichotomous variable with no natural ranking of its two values, as an independent variable. We could use the same method to compare two countries. We can deselect the other countries, assign the value 1 to those that live in one of the two countries and 0 to those who live in the other, and, finally, use this dichotomous, numerical variable as an independent variable. Now, if we are comparing three or more countries and want to see how living in one rather than another of them is associated with the average value of our dependent variable, we can just extend this technique by creating a set of dichotomous variables with 1 and 0 values.

Dichotomous variables with 1 and 0 values are called ‘dummy variables’. Hence, the technique has been dubbed the dummy variable method. This is how we proceed: We choose one country as our ‘reference category’ and make one dummy variable for each of the other countries. A particular country’s dummy is coded as follows: Persons are assigned value 1 if they live in that country and value 0 if they do not. Thus, those who belong to our reference country sample are assigned the value 0 on all dummy variables.

# Example, part one: Create dummy variables

If you choose to use the syntax, you should still read through the following text.

Syntax for example* The following command causes the cases to be weighted by the design weight variable 'dweight'.

* The following commands cause SPSS to select for analysis those cases that belong to the British, Polish or Norwegian sample (values GB, PL and NO on the country variable) and have lower values than 1975 on the birth year variable . * In this process the commands create a filter variable (filter_$) with value 1 for the selected cases and value 0 for the non-selected cases. * Change the last part of line 2 (which starts after the first equals sign) if you wish to select other cases (if you do this, you should also change the variable label, which can be found within double quotation marks on line 3).

* Use this command to create a dummy variable that assigns value 1 to members of the Polish sample and 0 to the other selected cases.

* Creates a dummy variable that assigns the value 1 to members of the British sample and 0 to the rest.

* Runs regression with the two dummy variables as independent variables and length of education as dependent variable.

Let us say that we wish to estimate the association between country and length of education and that we wish to use data from Great Britain, Poland and Norway. First, we must choose a reference country. This choice has some implications for what kind of information the analysis will produce. (It does not affect the assessment of whether or not there is an association between the variables but, of course, it affects what comparisons are made between individual countries.) You may, therefore, prefer to choose a reference country that is particularly suited as a basis with which to compare the other countries, but you may also prefer to choose a reference country that is represented by a relatively large sample, because this improves the accuracy of the regression coefficients as estimates of population coefficients.

There are no small country samples in the ESS data set, so you do not have to worry about sample sizes when comparing countries using ESS data, but other variables’ values may be distributed differently. Check how frequently the different values of a variable occur by going to ‘Descriptive statistics’ and ‘Frequencies’ on the ‘Analyze’ menu. Find the variables of interest on the variable list on the left, select them, put them in the ‘Variables’ field on the right and click ‘OK’. Frequency tables will appear in the output window. It is always advisable to check the value distributions of the different variables before you perform transformations on them or use them in regression analyses. Use histograms to check the distributions of metric variables.

If some of the persons have missing values or ‘don’t know’ answers etc. on the nominal variable, you may also want to create a separate dummy variable for this category of persons, with value 1 for those who have missing values and value 0 for all the others. If you do not do this, you run the risk of including people with missing values in your reference category, which is probably not where you would want them to appear, or, alternatively, you will fail to utilise all your data, which is normally not a good idea if the number of missing values is large. (You may have to take special precautions to compute a dummy for persons with missing values.) Here, however, we present an example where there are no missing values.

Say that for some reason we choose to compare Britons and Poles with Norwegians. We therefore choose the Norwegian sample as our reference category and create one dummy variable for each of the other two countries: one in which the British are assigned value 1, while people from the other two countries are assigned 0, and one on which the Polish are assigned 1 while Britons and Norwegians are assigned 0. These variables can be computed in several ways. Here, we propose to use the following procedure: Use the ‘Compute Variable’ module from the ‘Transform’ menu. After opening the dialoguebox, give the new variable a name (e.g. Great Britain) and a label (e.g. ‘Lives in Great Britain’). Then, on the right-hand side of the equals sign, type ANY(cntry,'GB'), or select the ANY(?,?) function from the ‘Functions and special variables’ list by clicking it before clicking the arrow. Then replace the first question mark with the country variable’s name and the second question mark with the string value of Britons (GB) in single quotes. (See how to get information about value codes here.) Finally, click ’OK’.

Figure 13. Computing a dummy variable

Repeat the procedure to create a variable for members of the Polish sample. Check Figure 13 to see which changes you have to make to the commands in the dialogue box to create this second dummy variable. If you want to compare additional countries with the reference country, you must include them in the active data set and create dummy variables for each one of them. You can also compare groups of countries, but in this case you must use the population size weight in order to take account of their different population sizes. A link to information about the uses of this weight can be found here. See also our discussion of problems associated with pooling of country samples in chapter 7.

The coding of our two new dummy variables is displayed in table 7.

Dummy for Poland | Dummy for Great Britain | |
---|---|---|

Value Poles | 1 | 0 |

Value Britons | 0 | 1 |

Value Norwegians | 0 | 0 |

# Example, part two: regression with dummies

In order to estimate the association between the country variable and the education length variable we must use both these variables simultaneously in one single regression analysis. The regression function will look like this:

y_{i} = a + b_{1}∙x_{1i} + b_{2}∙x_{2i} + e_{i},

where y_{i} represents the education length values, x_{1i} the ‘Great Britain’ dummy variable values, and x_{2i} the ‘Poland’ dummy variable values. We use the linear regression dialogue box and enter the variables as shown in Figure 14.

This regression analysis produces the results presented in Table 8 and Table 9.

Table 8. SPSS output: Dummy variable regression goodness of fit statistics

Table 8 tells us that the differences between the mean education lengths of the three country samples ‘explain’ 5.3% of that variable’s total variance.

Note also that we can use the F-statistic to test whether there are any differences between the mean education lengths of the three countries’ populations. In our example the result of this test is displayed as a Sig.-value in the ANOVA-table that we obtain when we run the regression analysis. This result is also displayed in the Model summary table provided that we click the ‘Statistics’ button in the linear regression dialogue box and tick the ‘R squared change’ option before we run the analysis.

This F-test can be interpreted as a test of whether the set of dummy variables, conceived as a block of variables, explains any variance at all at the population level. This is a useful option because the variable set may have statistically significant aggregate effects even if none of the individual dummies have such effects when tested separately by means of the t-statistic (i.e. by means of the Sig. values obtained in the coefficients table).

The Constant coefficient shown in Table 9 can be interpreted as an estimate of the mean education length (in years) of people belonging to the reference category (i.e. Norwegians born before 1975). Why is that? Take a look at the regression function y_{i} = a + b_{1}∙x_{1i} + b_{2}∙x_{2i} + e_{i}. It results in the following predicted y-values: ŷ_{i} = a + b_{1}∙x_{1i} + b_{2}∙x_{2i}. We know that Norwegians’ values on both x-variables are 0. Hence, the only term on the right-hand side of the latter function that can differ from 0 is a, i.e. the constant term, which, according to Table 9, is estimated at 13.214. The implication is that, if the person whose x-values we put into the function is a Norwegian, that person’s predicted length of education is 13.214 years, and, since this analysis does not distinguish between different types of Norwegians, this must be an estimate of the mean education length of all Norwegians born before 1975.

The coefficient of the ‘Great Britain’ variable should be interpreted as an estimate of the difference in mean education lengths between those aged 30+ who live in Great Britain and those (30+) who live in Norway. If we insert the x-values of a person who lives in Great Britain, the function will read: y_{i} = a + b_{1}∙1 + 0 + e_{i}. Thus, the predicted mean value of the dependent variable is a + b_{1}, which has been estimated at (13.214 – 1.206 = 12.008); hence the estimated mean education length of those who live in Great Britain is 12.008 years (i.e. 1.206 years shorter than the Norwegian mean value.) Similarly, the mean education length of those who live in Poland is 1.988 years shorter than the Norwegian mean value.

The Sig. values shown in Table 9 also reveal that, as long as we choose a significance level above 0.1%, both the British and the Polish mean education lengths are significantly shorter than the Norwegian ones (i.e. significantly shorter in the statistical sense of the word).

# Exercises

- One ESS variable, ‘Doing last 7 days’, has already been coded as a set of dummy variables. Select the British sample and choose one activity as a reference category (for example, paid work) and use this set of dummies as the independent variable in a regression analysis with ‘How happy are you’ as the dependent variable.
- Recode ‘Legal marital status’ as a set of dummies and use this set as an independent variable in a regression analysis with ‘How happy are you’ as the dependent variable. Use the above-mentioned criteria when you choose the reference category.